Thursday, July 4, 2019

Apache Parquet File Format

Apache Parquet is a file format. The Parquet fire format is designed as a columnar storage format to support complex data processing.
Apache Parquet is a self-describing data format which embeds the schema, or structure, within the data itself. This results in a file that is optimized for query performance and minimizing I/O. Specifically, it has the following characteristics:
  • Apache Parquet is column-oriented and designed to bring efficient columnar storage of data compared to row based files like CSV
  • Apache Parquet is built from the ground up with complex nested data structures in mind
  • Apache Parquet is built to support very efficient compression and encoding schemes (see Google Snappy)
  • Apache Parquet allows to lower storage costs for data files and maximizes the effectiveness of querying data with serverless technologies like Amazon Athena, Redshift Spectrum, BigQuery, and Azure Data Lakes.
  • Licensed under the Apache software foundation and available to any project.

No comments:

Post a Comment