简体   繁体   中英

Read CSV into AWS Glue and join with data from Data Catalogue

I'm fairly new to AWS Glue and would like to understand how to do the following:

  1. Pull a CSV file from a URL in AWS Glue
  2. Join the dataset with a column from a table I have in the Data Catalogue.
  3. Write this back to the Data Catalogue as a new table.

So far I have this:

  tableA_DF = pandas.read_cv("https://example.com/file.csv")
  tableB = glueContext.create_dynamic_frame.from_catalog(database=Z, table_name = Y)
  tableB_DF = TableB.toDF()

However, I'm not sure on how to join the two. Would like to simply add one column from tableB to tableA, and then store the result in the Data catalogue.

You can read directly into Dynamicframe from S3 using create_dynamic_frame_from_options .

Lookupdata  = GlueContext.create_dynamic_frame_from_options(connection_type="s3",connection_options = {"paths":[InputLookupDir]},format="csv",format_options={"withHeader": True,"separator": ",","quoteChar": '"',"escaper": '"'})

Then read the data from catalog using as shown below:

datasource0 = GlueContext.create_dynamic_frame.from_catalog(database = Glue_catalog_database, table_name = Glue_table_name, transformation_ctx = "datasource0")

Then you can join two frames using join and then write this back to catalogue.

[joined_dyf][1] = Join.apply(Lookupdata,Join.apply(datasource0, Lookupdata,'someidfrom_datasource0','someidfrom_Lookupdata')

Now write this back to Catalogue as table refer tothis for more information:

sink = glueContext.getSink(connection_type="s3", path="s3://path/to/data",
                           enableUpdateCatalog=True, updateBehavior="UPDATE_IN_DATABASE",
                           partitionKeys=["partition_key0", "partition_key1"])
sink.setFormat("<format>")
sink.setCatalogInfo(catalogDatabase=<dst_db_name>, catalogTableName=<dst_tbl_name>)
sink.writeFrame(last_transform)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM