简体   繁体   中英

Performance issue with AWS EMR S3DistCp

I am using S3DistCp on an EMR cluster in order to aggregate around 200K small files (for a total of 3.4GB) from a S3 bucket to another path in the same bucket. It is working but it is extremely slow (around 600MB transferred after more than 20 minutes).

Here is my EMR configuration:

1 master m5.xlarge
3 cores m5.xlarge
release label 5.29.0

The command:

s3-dist-cp --s3Endpoint=s3-eu-central-1.amazonaws.com --src=s3://my-bucket/input/ --dest=s3://my-bucket/output/ --groupBy=.*input/(entry).*(.json.gz) --targetSize=128

Am I missing something? I have read that S3DistCp can transfer a lot of files in a blink but I can't figure how. Both EMR and bucket are in the same region by the way.

Thank you.

Here are the recomendations

  1. use R type instance. It will provide more memory compared to M type instances
  2. use coalesce to merge the files in source as you have many small files
  3. Check the number of mapper tasks. The more the task, the lesser the performance

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM