AWS EMR群集流式傳輸步驟：錯誤請求

Question

我正在嘗試設置一個簡單的EMR作業來執行大量文本文件的字數統計，存儲在s3://__mybucket__/input/ 。 我無法正確添加兩個所需的流步驟中的第一步（第一個是wordSplitter.py地圖輸入，使用IdentityReducer減少到臨時存儲;第二步是使用/bin/wc/映射此二級存儲的內容，並再次使用IdentityReducer減少）。

這是第一步的（失敗）描述：

Status:FAILED
Reason:S3 Service Error.
Log File:s3://aws-logs-209733341386-us-east-1/elasticmapreduce/j-2XC5AT2ZP48FJ/steps/s-1SML7U7CXRDT5/stderr.gz
Details:Exception in thread "main" com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: 7799087FCAE73457), S3 Extended Request ID: nQYTtW93TXvi1G8U4LLj73V1xyruzre+uSt4KN1zwuIQpwDwa+J8IujOeQMpV5vRHmbuKZLasgs=
JAR location: command-runner.jar
Main class: None
Arguments: hadoop-streaming -files s3://elasticmapreduce/samples/wordcount/wordSplitter.py -mapper wordSplitter.py -reducer org.apache.hadoop.mapred.lib.IdentityReducer -input s3://__mybucket__/input/ -output s3://__mybucket__/output/
Action on failure: Continue

這是發送到hadoop集群的命令：

JAR location : command-runner.jar
Main class : None
Arguments : hadoop-streaming -mapper s3a://elasticmapreduce/samples/wordcount/wordSplitter.py -reducer aggregate -input s3a://__my_bucket__/input/ -output s3a://__my_bucket__/output/

Answer 1

我認為這里的解決方案可能非常簡單。

而不是s3://使用s3a://作為訪問存儲桶的作業的方案。 請參閱此處，不推薦使用s3://方案，並且要求相關存儲桶對您的Hadoop數據是獨占的。 從上面的doc鏈接引用：

此文件系統要求您為文件系統專用存儲桶 - 不應使用包含文件的現有存儲桶，也不應將其他文件寫入同一存儲桶。 此文件系統存儲的文件可能大於5GB，但它們不能與其他S3工具互操作。

AWS EMR群集流式傳輸步驟：錯誤請求

問題描述

1 個解決方案

解決方案1
2 2016-12-21 20:06:54

AWS EMR群集流式傳輸步驟：錯誤請求

問題描述

1 個解決方案

解決方案1 2 2016-12-21 20:06:54

解決方案1
2 2016-12-21 20:06:54