Spark, Hadoop, Hive and Programming: Apache Parquet File Format

Thursday, July 4, 2019

Apache Parquet File Format

Apache Parquet is a file format. The Parquet fire format is designed as a columnar storage format to support complex data processing.

Apache Parquet is a self-describing data format which embeds the schema, or structure, within the data itself. This results in a file that is optimized for query performance and minimizing I/O. Specifically, it has the following characteristics:

Apache Parquet is column-oriented and designed to bring efficient columnar storage of data compared to row based files like CSV
Apache Parquet is built from the ground up with complex nested data structures in mind
Apache Parquet is built to support very efficient compression and encoding schemes (see Google Snappy)
Apache Parquet allows to lower storage costs for data files and maximizes the effectiveness of querying data with serverless technologies like Amazon Athena, Redshift Spectrum, BigQuery, and Azure Data Lakes.
Licensed under the Apache software foundation and available to any project.

Spark, Hadoop, Hive and Programming

Thursday, July 4, 2019

Apache Parquet File Format

No comments:

Post a Comment