I'm fairly new to AWS Glue and would like to understand how to do the following:
So far I have this:
tableA_DF = pandas.read_cv("https://example.com/file.csv")
tableB = glueContext.create_dynamic_frame.from_catalog(database=Z, table_name = Y)
tableB_DF = TableB.toDF()
However, I'm not sure on how to join the two. Would like to simply add one column from tableB to tableA, and then store the result in the Data catalogue.
You can read directly into Dynamicframe from S3 using create_dynamic_frame_from_options
.
Lookupdata = GlueContext.create_dynamic_frame_from_options(connection_type="s3",connection_options = {"paths":[InputLookupDir]},format="csv",format_options={"withHeader": True,"separator": ",","quoteChar": '"',"escaper": '"'})
Then read the data from catalog using as shown below:
datasource0 = GlueContext.create_dynamic_frame.from_catalog(database = Glue_catalog_database, table_name = Glue_table_name, transformation_ctx = "datasource0")
Then you can join two frames using join and then write this back to catalogue.
[joined_dyf][1] = Join.apply(Lookupdata,Join.apply(datasource0, Lookupdata,'someidfrom_datasource0','someidfrom_Lookupdata')
Now write this back to Catalogue as table refer tothis for more information:
sink = glueContext.getSink(connection_type="s3", path="s3://path/to/data",
enableUpdateCatalog=True, updateBehavior="UPDATE_IN_DATABASE",
partitionKeys=["partition_key0", "partition_key1"])
sink.setFormat("<format>")
sink.setCatalogInfo(catalogDatabase=<dst_db_name>, catalogTableName=<dst_tbl_name>)
sink.writeFrame(last_transform)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.