简体   繁体   English

AWS Data Pipeline 不断遇到 FileAlreadyExistsException

[英]AWS Data Pipeline keeps running into FileAlreadyExistsException

I basically followed this tutorial to set up a simple DataPipeline to export my DynamoDB table to S3.我基本上按照本教程设置了一个简单的 DataPipeline 来将我的 DynamoDB 表导出到 S3。

But whenever I tried to run it, it keeps throwing Details:Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory s3://table-ddb-backup/ already exists which doesn't make any sense to me, as I double checked, this bucket doesn't even exist in my AWS account, how come it says already exists? But whenever I tried to run it, it keeps throwing Details:Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory s3://table-ddb-backup/ already exists which doesn't make any sense对我来说,当我仔细检查时,我的 AWS 账户中甚至都不存在这个存储桶,为什么它说已经存在?

Also, I've changed the bucket name to a different one, still the same error persists, any pointers please?另外,我已将存储桶名称更改为不同的名称,仍然存在相同的错误,请指点?

Edit: I just learned from AWS docs that each AWS S3 bucket name must be globally unique in each partition, I thought as long as each S3 bucket name is unique within this AWS account is good enough.编辑:我刚刚从AWS 文档中了解到,每个 AWS S3 存储桶名称在每个分区中必须是全局唯一的,我认为只要每个 S3 存储桶名称在这个 AWS 账户中是唯一的就足够了。 But this still doesn't explain why this data pipeline jobs keeps failing with this error.但这仍然不能解释为什么这个数据管道作业总是因为这个错误而失败。

Thanks!谢谢!

I figured it out by adding this in part of my CDK code when provisioning my data pipeline:在配置我的数据管道时,我通过在我的 CDK 代码的一部分中添加它来解决这个问题:

{
    "key": "preStepCommand",
    "stringValue": "(sudo yum -y update aws-cli) && (aws s3 rm #{output.directoryPath} --recursive)"
},

EMR needs to have an empty directory each time it runs. EMR 每次运行时都需要有一个空目录。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM