简体   繁体   English

AWS Glue 作业不写入 S3

[英]AWS Glue jobs not writing to S3

I have just been playing around with Glue but have yet to get it to successfully create a new table in an existing S3 bucket.我刚刚一直在玩弄 Glue,但还没有让它在现有的 S3 存储桶中成功创建一个新表。 The job will execute without error but there is never any output in S3.该作业将无错误地执行,但 S3 中从来没有任何 output。

Here's what the auto generated code is:以下是自动生成的代码:

glueContext.write_dynamic_frame.from_options(frame = applymapping1, 
connection_type = "s3", connection_options = {"path": 
"s3://glueoutput/output/"}, format = "json", transformation_ctx = 
"datasink2") 

Have tried all variations of this - with name of file (that doesn't exist yet), in root folder of bucket, trailing slash and without.已经尝试了这个的所有变体 - 使用文件名(尚不存在),在存储桶的根文件夹中,尾部斜杠和没有。 The role being used has full access to S3.正在使用的角色具有对 S3 的完全访问权限。 Tried creating buckets in different regions.尝试在不同区域创建存储桶。 No file is ever created though.虽然没有创建文件。 Again console says its successful.控制台再次说它成功了。

your code is correct, just verify if there is any data at all in applymapping1 DF?您的代码是正确的,只需验证applymapping1 DF中是否有任何数据? you check with this command : applymapping1.toDF().show()你用这个命令检查:applymapping1.toDF().show()

As @Drellgor suggests in his comment to the previous answer, make sure you disabled "Job Bookmarks" unless you definitely don't want to process old files.正如@Drellgor 在他对上一个答案的评论中所建议的那样,除非您绝对不想处理旧文件,否则请确保禁用“作业书签”。

From the documentation :文档

"AWS Glue tracks data that has already been processed during a previous run of an ETL job by persisting state information from the job run. This persisted state information is called a job bookmark. Job bookmarks help AWS Glue maintain state information and prevent the reprocessing of old data." “AWS Glue 通过保留作业运行中的状态信息来跟踪在 ETL 作业的前一次运行期间已经处理过的数据。这种持久化的状态信息称为作业书签。作业书签帮助 AWS Glue 维护状态信息并防止重新处理旧数据。”

Had the same issue.有同样的问题。 After a few days of what seemed like Glue jobs randomly write to s3 and sometimes don't I found this thread.几天后,似乎 Glue 作业随机写入 s3,有时却没有,我发现了这个线程。 @Sinan Erdem suggestion resolved my issue. @Sinan Erdem 的建议解决了我的问题。

From the aws docs:从 aws 文档:

Job bookmarks are used to track the source data that has already been processed, preventing the reprocessing of old data.作业书签用于跟踪已经处理过的源数据,防止重新处理旧数据。 Job bookmarks can be used with JDBC data sources and some Amazon Simple Storage Service (Amazon S3) sources.作业书签可用于 JDBC 数据源和一些 Amazon Simple Storage Service (Amazon S3) 源。 Job bookmarks are tied to jobs.工作书签与工作相关。 If you delete a job, then its job bookmark is also deleted.如果删除作业,则其作业书签也会被删除。

You can rewind your job bookmarks for your Glue Spark ETL jobs to any previous job run, which allows your job to reprocess the data.您可以将 Glue Spark ETL 作业的作业书签倒回到任何以前的作业运行,这允许您的作业重新处理数据。 If you want to reprocess all the data using the same job, you can reset the job bookmark如果要使用同一个作业重新处理所有数据,可以重置作业书签

Also found this: How to rewind Job Bookmarks on Glue Spark ETL job?还发现: How to rewind Job Bookmarks on Glue Spark ETL job?

You need to edit your IAM role.您需要编辑您的 IAM 角色。 You should define your IAM role can write and also read from the S3.您应该定义您的 IAM 角色可以写入和读取 S3。

  1. Go to your AWS console转到您的 AWS 控制台
  2. Go to IAM转到 IAM
  3. Policies政策
  4. Edit Policy编辑政策
  5. Add following Put and Delete Object for S3 in addition to get object.除了获取对象之外,还为 S3 添加以下放置和删除对象。
  6. Then save然后保存

Be sure you are running AWS Glue with the IAM Role you edited.确保您使用您编辑的 IAM 角色运行 AWS Glue。 Good Luck.祝你好运。

"Effect": "Allow",
        "Action": [
            "s3:GetObject",
            "s3:PutObject",
            "s3:DeleteObject"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM