简体   繁体   中英

Are parquet files splittable when stored in AWS S3?

  • I know that parquet files are splittable if they are stored in block storage. Eg stored on HDFS
  • Are they also splittable when stored in object storage such as AWS s3?
  • This confuses me because, object storage is supposed to be atomic. You either access the entire file or none of the file. You can't even change meta data on an S3 file without rewriting the entire file. On the other hand, AWS reccomends using splittable file formats in S3 to improve the performance of Athena and other frameworks in the hadoop ecosystem.

Yes, Parquet files are splittable.

S3 supports positioned reads (range requests), which can be used to read only selected portions of the input file (object).

I'm not 100% sure what you mean here, but generally (I think), you have parquet partition on partition keys and save columns into blocks of rows. When I have used in it AWS S3 it has saved like:

|-Folder
|--Partition Keys
|---Columns
|----Rows_1-100.snappy.parquet
|----Rows_101-200.snappy.parquet

This handles the splitting efficiencies you mention.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM