The dynamoDB table backup using data pipeline aws process got error as:
02 May 2017 07:19:04,544 [WARN] (TaskRunnerService-df-0940986HJGYQM1ZJ8BN_@EmrClusterForBackup_2017-04-25T13:31:55-2) df-0940986HJGYQM1ZJ8BN amazonaws.datapipeline.cluster.EmrUtil: EMR job flow named 'df-0940986HJGYQM1ZJ8BN_@EmrClusterForBackup_2017-04-25T13:31:55' with jobFlowId 'j-2SJ0OQOM0BTI' is in status 'RUNNING' because of the step 'df-0940986HJGYQM1ZJ8BN_@TableBackupActivity_2017-04-25T13:31:55_Attempt=2' failures 'null'
02 May 2017 07:19:04,544 [INFO] (TaskRunnerService-df-0940986HJGYQM1ZJ8BN_@EmrClusterForBackup_2017-04-25T13:31:55-2) df-0940986HJGYQM1ZJ8BN amazonaws.datapipeline.cluster.EmrUtil: EMR job '@TableBackupActivity_2017-04-25T13:31:55_Attempt=2' with jobFlowId 'j-2SJ0OQOM0BTI' is in status 'RUNNING' and reason 'Running step'. Step 'df-0940986HJGYQM1ZJ8BN_@TableBackupActivity_2017-04-25T13:31:55_Attempt=2' is in status 'FAILED' with reason 'null'
02 May 2017 07:19:04,544 [INFO] (TaskRunnerService-df-0940986HJGYQM1ZJ8BN_@EmrClusterForBackup_2017-04-25T13:31:55-2) df-0940986HJGYQM1ZJ8BN amazonaws.datapipeline.cluster.EmrUtil: Collecting steps stderr logs for cluster with AMI 3.9.0
02 May 2017 07:19:04,558 [INFO] (TaskRunnerService-df-0940986HJGYQM1ZJ8BN_@EmrClusterForBackup_2017-04-25T13:31:55-2) df-0940986HJGYQM1ZJ8BN amazonaws.datapipeline.taskrunner.LogMessageUtil: Returning tail errorMsg : at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132)
at org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:460)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:343)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:562)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:557)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:833)
at org.apache.hadoop.dynamodb.tools.DynamoDbExport.run(DynamoDbExport.java:79)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.dynamodb.tools.DynamoDbExport.main(DynamoDbExport.java:30)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
There is large number of data (6 million). The Pipeline worked for 4 days and got error. Cant figure out the error.
From analyzing your logs , and particularly this line...
org.apache.hadoop.dynamodb.tools.DynamoDbExport
it seems you are running AWS Data Pipeline created on one of the pre-defined templates named "Export DynamoDB table to S3" .
This Data Pipeline is taking several input parameters that you can edit in pipeline architect, but most important parameters are:
s3://<S3_BUCKET_NAME>/<S3_BUCKET_PREFIX>/2019-08-13-15-32-02
) Now that we are familiar with these parameters which are needed by this data pipeline, let's look at this log line:
02 May 2017 07:19:04,558 [INFO] (TaskRunnerService-df-0940986HJGYQM1ZJ8BN_@EmrClusterForBackup_2017-04-25T13:31:55-2) df-0940986HJGYQM1ZJ8BN amazonaws.datapipeline.taskrunner.LogMessageUtil: Returning tail errorMsg : at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132)
Although it didn't bubble up on the ERROR log level, this is error message is coming from Hadoop framework not being able to recognize output format location for Hadoop job. When your Data Pipeline submitted task to Hadoop's TaskRunner, it evaluated output location format and realized that it is not something that can be supported. This can mean multiple things:
I would suggest inspecting myOutputS3Loc parameter and passed values to make sure your MR job is getting right inputs. You can also verify what parameters were submitted to EMR task by inspecting controller logs in EMR console while your job is running.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.