简体   繁体   English

如何避免 AWS Athena CTAS 查询创建小文件?

[英]How to avoid AWS Athena CTAS query creating small files?

I'm unable to figure out what is wrong with my CTAS query, it breaks the data into smaller files while storing inside a partition even though I haven't mentioned any bucketing columns.我无法弄清楚我的 CTAS 查询出了什么问题,即使我没有提到任何分桶列,它也会在存储在分区内时将数据分解成更小的文件。 Is there a way to avoid these small files and store as one single file per partition as files lesser than 128 MB would cause additional overhead?有没有办法避免这些小文件并将每个分区存储为一个文件,因为小于 128 MB 的文件会导致额外的开销?

CREATE TABLE sampledb.yellow_trip_data_parquet
WITH(
    format = 'PARQUET'
    parquet_compression = 'GZIP',
    external_location='s3://mybucket/Athena/tables/parquet/'
    partitioned_by=ARRAY['year','month']
)
AS SELECT
    VendorID,
    tpep_pickup_datetime,
    tpep_dropoff_datetime,
    passenger_count,
    trip_distance,
    RatecodeID,
    store_and_fwd_flag,
    PULocationID,
    DOLocationID,
    payment_type,
    fare_amount,
    extra,
    mta_tax,
    tip_amount,
    tolls_amount,
    improvement_surcharge,
    total_amount,
    date_format(date_parse(tpep_pickup_datetime,'%Y-%c-%d %k:%i:%s'),'%Y')  AS year,
    date_format(date_parse(tpep_pickup_datetime,'%Y-%c-%d %k:%i:%s'),'%c')  AS month
FROM sampleDB.yellow_trip_data_raw;

来自我的分区的图像

I was able to overcome the issue by creating a bucketing column month_a .我能够通过创建一个分桶列month_a来克服这个问题。 Below is the code下面是代码

CREATE TABLE sampledb.yellow_trip_data_avro
WITH (
    format = 'AVRO',
    external_location='s3://a4189e1npss3001/Athena/internal_tables/avro/',
    partitioned_by=ARRAY['year','month'],
    bucketed_by=ARRAY['month_a'],
    bucket_count=12
) AS SELECT
    VendorID,
    tpep_pickup_datetime,
    tpep_dropoff_datetime,
    passenger_count,
    trip_distance,
    RatecodeID,
    store_and_fwd_flag,
    PULocationID,
    DOLocationID,
    payment_type,
    fare_amount,
    extra,
    mta_tax,
    tip_amount,
    tolls_amount,
    improvement_surcharge,
    total_amount,
    date_format(date_parse(tpep_pickup_datetime, '%Y-%c-%d %k:%i:%s'),'%c') AS month_a,
    date_format(date_parse(tpep_pickup_datetime, '%Y-%c-%d %k:%i:%s'),'%Y') AS year,
    date_format(date_parse(tpep_pickup_datetime, '%Y-%c-%d %k:%i:%s'),'%c') AS month
FROM sampleDB.yellow_trip_data_raw;

Athena is a distributed system, and it will scale the execution on your query by some unobservable mechanism. Athena 是一个分布式系统,它将通过一些不可观察的机制扩展查询的执行。 It looks like it decided to use five workers for your CTAS query, which will result in five files in each partition.看起来它决定为您的 CTAS 查询使用五个工作器,这将导致每个分区中有五个文件。

You could try explicitly specifying a bucket size of one, but you might still get multiple files, if I remember correctly.您可以尝试将存储桶大小明确指定为 1,但如果我没记错的话,您可能仍会得到多个文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM