简体   繁体   中英

AWS Athena | CSV vs Parquet | size of data scanned

TLDR : Athena: select top 10 scans more data for parquet format, than csv format. Shouldn't it be the other way round?

I am using Athena(V1) to query the following two datasets (same data but two different file formats):

Format Size Athena DB name Athena table name dataset description
CSV 91.3 MB nycitytaxi data nycity taxi trip, present in a public s3 bucket
Parquet 19.4 MB nycitytaxi aws_glue_result_xxxx same data as above converted to parquet - through a Glue Crawler job - and stored in one of my S3 buckets

Now I am executing the following query on both the tables:

select lpep_pickup_datetime, lpep_dropoff_datetime 
from nycitytaxi.<table_name>
limit 10

On executing this query on the csv based table (table_name: data), Athena console shows it scanned 721.96 KB of data.

On executing this query on the parquet based table (table_name: aws_glue_result_xxxx), Athena console shows it scanned 10.9 MB of data.

Shouldn't Athena be scanning way less data for the parquet based table, since parquet is columnar based, as opposed to row based storage for CSV?

It is due to your specific query.

select lpep_pickup_datetime, lpep_dropoff_datetime 
from nycitytaxi.<table_name>
limit 10

In row based formats like CSV, all data is stored row wise. Which means as soon as you say, select any 10 rows, it can just start reading the csv file from the beginning and select the first 10 rows, resulting in very low data scan.

In columnar data formats like parquet, the records are stored column wise. Let us assume the data has three columns, say id , name , number . This means, all of id values will be stored together, all name values will be stored together and all number values will be stored together. So when you run the query, select 10 rows in parquet, i will have to scan for 10 values in each column which are present in different storage locations. Which means I will have to scan more.

More on parquet pros and cons here .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM