简体   繁体   中英

AWS Glue jobs not writing to S3

I have just been playing around with Glue but have yet to get it to successfully create a new table in an existing S3 bucket. The job will execute without error but there is never any output in S3.

Here's what the auto generated code is:

glueContext.write_dynamic_frame.from_options(frame = applymapping1, 
connection_type = "s3", connection_options = {"path": 
"s3://glueoutput/output/"}, format = "json", transformation_ctx = 
"datasink2") 

Have tried all variations of this - with name of file (that doesn't exist yet), in root folder of bucket, trailing slash and without. The role being used has full access to S3. Tried creating buckets in different regions. No file is ever created though. Again console says its successful.

your code is correct, just verify if there is any data at all in applymapping1 DF? you check with this command : applymapping1.toDF().show()

As @Drellgor suggests in his comment to the previous answer, make sure you disabled "Job Bookmarks" unless you definitely don't want to process old files.

From the documentation :

"AWS Glue tracks data that has already been processed during a previous run of an ETL job by persisting state information from the job run. This persisted state information is called a job bookmark. Job bookmarks help AWS Glue maintain state information and prevent the reprocessing of old data."

Had the same issue. After a few days of what seemed like Glue jobs randomly write to s3 and sometimes don't I found this thread. @Sinan Erdem suggestion resolved my issue.

From the aws docs:

Job bookmarks are used to track the source data that has already been processed, preventing the reprocessing of old data. Job bookmarks can be used with JDBC data sources and some Amazon Simple Storage Service (Amazon S3) sources. Job bookmarks are tied to jobs. If you delete a job, then its job bookmark is also deleted.

You can rewind your job bookmarks for your Glue Spark ETL jobs to any previous job run, which allows your job to reprocess the data. If you want to reprocess all the data using the same job, you can reset the job bookmark

Also found this: How to rewind Job Bookmarks on Glue Spark ETL job?

You need to edit your IAM role. You should define your IAM role can write and also read from the S3.

  1. Go to your AWS console
  2. Go to IAM
  3. Policies
  4. Edit Policy
  5. Add following Put and Delete Object for S3 in addition to get object.
  6. Then save

Be sure you are running AWS Glue with the IAM Role you edited. Good Luck.

"Effect": "Allow",
        "Action": [
            "s3:GetObject",
            "s3:PutObject",
            "s3:DeleteObject"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM