简体   繁体   中英

How can I optimize the read from S3?

 dyf_pagewise_word_count = glueContext.create_dynamic_frame.from_options(
 connection_type="s3",
 format="csv",
 connection_options={
     "paths": ["s3://somefile.csv/"],
     'recurse':True, 
     'groupFiles': 'inPartition', 
     'groupSize': '100000'
 },
 format_options={
     "withHeader": True,
     "separator": ","
 }
)

It takes 45 secs to read from S3. Is there any way to optimize the read time?

You could try the optimizePerformance option if you're using glue 3.0. It batches records to reduce IO. See this for more details

dyf_pagewise_word_count = glueContext.create_dynamic_frame.from_options(
 connection_type="s3",
 format="csv",
 connection_options={
     "paths": ["s3://somefile.csv/"],
     'recurse':True, 
     'groupFiles': 'inPartition', 
     'groupSize': '100000'
 },
 format_options={
     "withHeader": True,
     "separator": ",",
     "optimizePerformance": True, 
 }
)

Also, could you convert the CSV to something like Parquet upstream of the read?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM