使用s3distcp将文件从amazon s3复制到hdfs失败

Question

I am trying to copy files from s3 to hdfs using workflow in EMR and when I run the below command the jobflow successfully starts but gives me an error when it tries to copy the file to HDFS .Do i need to set any input file permissions ? 我正在尝试使用EMR中的工作流将文件从s3复制到hdfs，当我运行以下命令时，作业流程成功启动但在尝试将文件复制到HDFS时出现错误。我是否需要设置任何输入文件权限？

Command: 命令：

./elastic-mapreduce --jobflow j-35D6JOYEDCELA --jar s3://us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar --args '--src,s3://odsh/input/,--dest,hdfs:///Users ./elastic-mapreduce --jobflow j-35D6JOYEDCELA --jar s3：//us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar --args'-- src，s3：// odsh /输入/， - DEST，HDFS：///用户

Output 产量

Task TASKID="task_201301310606_0001_r_000000" TASK_TYPE="REDUCE" TASK_STATUS="FAILED" FINISH_TIME="1359612576612" ERROR="java.lang.RuntimeException: Reducer task failed to copy 1 files: s3://odsh/input/GL_01112_20121019.dat etc at com.amazon.external.elasticmapreduce.s3distcp.CopyFilesReducer.close(CopyFilesReducer.java:70) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:538) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:429) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132) at org.apache.hadoop.mapred.Child.main(Child.java:249) 任务TASKID =“task_201301310606_0001_r_000000”TASK_TYPE =“REDUCE”TASK_STATUS =“FAILED”FINISH_TIME =“1359612576612”ERROR =“java.lang.RuntimeException：Reducer任务无法复制1个文件：s3：//odsh/input/GL_01112_20121019.dat等位于org.apache.hadoop.mapred.ReduceTask的org.apache.hadoop.mapred.ReduceTask.runOldReducer（ReduceTask.java:538）的com.amazon.external.elasticmapreduce.s3distcp.CopyFilesReducer.close（CopyFilesReducer.java:70） .run（ReduceTask.java:429）位于javax.security.auth.Subject的java.security.AccessController.doPrivileged（Native Method）的org.apache.hadoop.mapred.Child $ 4.run（Child.java:255）。 doA（Subject.java:396）org.apache.hadoop.security.UserGroupInformation.doAs（UserGroupInformation.java:1132）atg.apache.hadoop.mapred.Child.main（Child.java:249）

Answer 1

I'm getting the same exception. 我得到了同样的例外。 It looks like the bug is caused by a race condition when CopyFilesReducer uses multiple CopyFilesRunable instances to download the files from S3. 当CopyFilesReducer使用多个CopyFilesRunable实例从S3下载文件时，看起来该错误是由竞争条件引起的。 The problem is that it uses the same temp directory in multiple threads, and the threads delete the temp directory when they're done. 问题是它在多个线程中使用相同的临时目录，并且线程在完成后删除临时目录。 Hence, when one thread completes before another it deletes the temp directory that another thread is still using. 因此，当一个线程在另一个线程之前完成时，它会删除另一个线程仍在使用的临时目录。

I've reported the problem to AWS, but in the mean time you can work around the bug by forcing the reducer to use a single thread by setting the variable s3DistCp.copyfiles.mapper.numWorkers to 1 in your job config. 我已经向AWS报告了这个问题，但同时你可以通过在你的作业配置中将变量s3DistCp.copyfiles.mapper.numWorkers设置为1来强制reducer使用单个线程来解决这个问题。

Answer 2

I see this same problem caused by race condition. 我看到由种族条件引起的同样问题。 Passing -Ds3DistCp.copyfiles.mapper.numWorkers=1 helps avoid the problem. 传递-Ds3DistCp.copyfiles.mapper.numWorkers=1有助于避免此问题。

I hope Amazon fixes this bug. 我希望亚马逊修复这个错误。

Answer 3

Adjusting the number of workers didn't work for me; 调整工人数对我不起作用; s3distcp always failed on a small/medium instance. s3distcp在小型/中型实例上总是失败。 Increasing the heap size of the task job (via -D mapred.child.java.opts=-Xmx1024m ) solved it for me. 增加任务作业的堆大小（通过-D mapred.child.java.opts=-Xmx1024m ）解决了它。

Example usage: 用法示例：

hadoop jar /home/hadoop/lib/emr-s3distcp-1.0.jar 
    -D mapred.child.java.opts=-Xmx1024m 
    --src s3://source/
    --dest hdfs:///dest/ --targetSize 128
    --groupBy '.*\.([0-9]+-[0-9]+-[0-9]+)-[0-9]+\..*' 
    --outputCodec gzip

Answer 4

The problem is the map - reduce jobs fail. 问题是地图 - 减少工作失败。 Mapper execute perfectly but reducers create a bottle neck at the clusters memory. Mapper执行完美，但Reducer在集群内存中创建瓶颈。

THIS SOLVED for me -Dmapreduce.job.reduces=30 if it still fails try to 这个解决了我-Dmapreduce.job.reduces = 30如果仍然失败尝试

reduce it to 20 ie -Dmapreduce.job.reduces=20 将它减少到20即-Dmapreduce.job.reduces = 20

I'll add the entire argument for ease of understanding: 为了便于理解，我将添加整个论点：

In AWS Cluster: 在AWS Cluster中：

JAR location : command-runner.jar JAR位置 ：command-runner.jar

Main class : None 主要类别 ：无

Arguments : s3-dist-cp -Dmapreduce.job.reduces=30 --src=hdfs:///user/ec2-user/riskmodel-output --dest=s3://dev-quant-risk-model/2019_03_30_SOM_EZ_23Factors_Constrained_CSR_Stats/output --multipartUploadChunkSize=1000 参数：s3-dist-cp -Dmapreduce.job.reduces = 30 --src = hdfs：/// user / ec2-user / riskmodel-output --dest = s3：// dev-quant-risk-model / 2019_03_30_SOM_EZ_23Factors_Constrained_CSR_Stats / output --multipartUploadChunkSize = 1000

Action on failure : Continue 失败行动 ：继续

in script file: 在脚本文件中：

aws --profile $AWS_PROFILE emr add-steps --cluster-id $CLUSTER_ID --steps Type=CUSTOM_JAR,Jar='command-runner.jar',Name="Copy Model Output To S3",ActionOnFailure=CONTINUE,Args=[s3-dist-cp,-Dmapreduce.job.reduces=20,--src=$OUTPUT_BUCKET,--dest=$S3_OUTPUT_LARGEBUCKET,--multipartUploadChunkSize=1000] aws --profile $ AWS_PROFILE emr add-steps --cluster-id $ CLUSTER_ID --steps Type = CUSTOM_JAR，Jar ='command-runner.jar'，Name =“Copy Model Output To S3”，ActionOnFailure = CONTINUE，Args = [S3-DIST-CP，-Dmapreduce.job.reduces = 20， - SRC = $ OUTPUT_BUCKET， - DEST = $ S3_OUTPUT_LARGEBUCKET， - multipartUploadChunkSize = 1000]

使用s3distcp将文件从amazon s3复制到hdfs失败

问题描述

4 个解决方案

解决方案1
6 2013-11-04 22:11:05

解决方案2
2 2014-07-12 22:17:06

解决方案3
2 2014-10-02 04:44:23

解决方案4
1 2019-04-01 08:06:48

使用s3distcp将文件从amazon s3复制到hdfs失败

问题描述

4 个解决方案

解决方案1 6 2013-11-04 22:11:05

解决方案2 2 2014-07-12 22:17:06

解决方案3 2 2014-10-02 04:44:23

解决方案4 1 2019-04-01 08:06:48

解决方案1
6 2013-11-04 22:11:05

解决方案2
2 2014-07-12 22:17:06

解决方案3
2 2014-10-02 04:44:23

解决方案4
1 2019-04-01 08:06:48