Apache Parquet is a file format. The Parquet fire format is designed as a columnar storage format to support complex data processing.
Apache Parquet is a self-describing data format which embeds the schema, or structure, within the data itself. This results in a file that is optimized for query performance and minimizing I/O. Specifically, it has the following characteristics:
- Apache Parquet is column-oriented and designed to bring efficient columnar storage of data compared to row based files like CSV
- Apache Parquet is built from the ground up with complex nested data structures in mind
- Apache Parquet is built to support very efficient compression and encoding schemes (see Google Snappy)
- Apache Parquet allows to lower storage costs for data files and maximizes the effectiveness of querying data with serverless technologies like Amazon Athena, Redshift Spectrum, BigQuery, and Azure Data Lakes.
- Licensed under the Apache software foundation and available to any project.
No comments:
Post a Comment