简体   繁体   中英

AWS Glue Job - CSV to Parquet. How to ignore header?

I need to convert a bunch (23) of CSV files (source s3) into parquet format. The input CSV contains headers in all files. When I generated code for that using Glue. The output contains 22 header rows also in separate rows which means it ignored the first header. I need help in ignoring all the headers while doing this transformation.

Since I'm using from_catalog function for my input, I don't have any format_options to ignore the header rows.

Also, can I set an option in the Glue table that the header is present in the files? Will that automatically ignore the header when my job runs?

Part of my current approach is below. I'm new to Glue. This code was actually auto-generated by Glue.

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "my_datalake", table_name = "my-csv-files", transformation_ctx = "datasource0")

datasink1 = glueContext.write_dynamic_frame.from_options(frame = datasource0, connection_type = "s3", connection_options = {"path": "s3://my-bucket-name/full/s3/path-parquet"}, format = "parquet", transformation_ctx = "datasink1")

Faced exact issue while working on a ETL job which used AWS Glue.

The documentation for from_catalog says:

  • additional_options – A collection of optional name-value pairs. The possible options include those listed in Connection Types and Options for ETL in AWS Glue except for endpointUrl, streamName, bootstrap.servers, security.protocol, topicName, classification, and delimiter.

I tried using the below snippet and some of its permutations with from_catalog. But nothing worked for me.

additional_options = {"format": "csv", "format_options": '{"withHeader": "True"}'},

One way to go about fixing this is by using from_options instead of from_catalog and pointing directly to the S3 bucket or folder. This is what it should look like:

datasource0 = glueContext.create_dynamic_frame.from_options(
  connection_type="s3",
  connection_options={
      'paths': ['s3://bucket_name/folder_name'],
      "recurse": True,
      'groupFiles': 'inPartition'
  }, 
  format="csv", 
  format_options={
      "withHeader": True
  }, 
  transformation_ctx = "datasource0"
)

But if you can't do this for any reason and want to stick with from_catalog , using a filter worked for me.

Assuming that one of your header's name is name , this is what the snippet can look like:

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "my_datalake", table_name = "my-csv-files", transformation_ctx = "datasource0")
filtered_df = Filter.apply(frame = datasource0, f = lambda x: x["name"] != "name")

Not very sure about how spark's dataframes or glue's dynamicframes deal with csv headers and why data read from catalog had headers in rows as well as schema, but this seemed to solve my issue by removing the header values from the rows.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM