Read CSV into AWS Glue and join with data from Data Catalogue

Question

I'm fairly new to AWS Glue and would like to understand how to do the following:

Pull a CSV file from a URL in AWS Glue
Join the dataset with a column from a table I have in the Data Catalogue.
Write this back to the Data Catalogue as a new table.

So far I have this:

  tableA_DF = pandas.read_cv("https://example.com/file.csv")
  tableB = glueContext.create_dynamic_frame.from_catalog(database=Z, table_name = Y)
  tableB_DF = TableB.toDF()

However, I'm not sure on how to join the two. Would like to simply add one column from tableB to tableA, and then store the result in the Data catalogue.

Answer 1

You can read directly into Dynamicframe from S3 using create_dynamic_frame_from_options .

Lookupdata  = GlueContext.create_dynamic_frame_from_options(connection_type="s3",connection_options = {"paths":[InputLookupDir]},format="csv",format_options={"withHeader": True,"separator": ",","quoteChar": '"',"escaper": '"'})

Then read the data from catalog using as shown below:

datasource0 = GlueContext.create_dynamic_frame.from_catalog(database = Glue_catalog_database, table_name = Glue_table_name, transformation_ctx = "datasource0")

Then you can join two frames using join and then write this back to catalogue.

[joined_dyf][1] = Join.apply(Lookupdata,Join.apply(datasource0, Lookupdata,'someidfrom_datasource0','someidfrom_Lookupdata')

Now write this back to Catalogue as table refer tothis for more information:

sink = glueContext.getSink(connection_type="s3", path="s3://path/to/data",
                           enableUpdateCatalog=True, updateBehavior="UPDATE_IN_DATABASE",
                           partitionKeys=["partition_key0", "partition_key1"])
sink.setFormat("<format>")
sink.setCatalogInfo(catalogDatabase=<dst_db_name>, catalogTableName=<dst_tbl_name>)
sink.writeFrame(last_transform)

Read CSV into AWS Glue and join with data from Data Catalogue

Question

1 answers

solution1
0 ACCPTED 2020-10-27 03:53:18

Read CSV into AWS Glue and join with data from Data Catalogue

Question

1 answers

solution1 0 ACCPTED 2020-10-27 03:53:18

solution1
0 ACCPTED 2020-10-27 03:53:18