Is there a way to write only “good” records to a SQL Server table and return the “bad” records using an AWS Glue job?

Question

I am trying to write a Glue (PySpark) job performs some ETL and eventually writes that data to a table in SQL Server (defined in the AWS Glue Catalog). While writing the records to the SQL Server table, there may be constraints (Examples: primary keys, foreign keys, column types) that prevent certain records (ie "bad" records) from being written to the table. When this happens the Glue job throws an error and the job fails. Is there a way to prevent the entire job from failing? Instead, would it be possible to write only the "good" records and return the "bad" records that violated the SQL Server back to the Glue job (so that they can be uploaded to S3)?

I am using the write_dynamic_frame_from_catalog function to write the data to the SQL Server table. Here is some sample code for context:

# perform etl 
output_df=spark.sql("SELECT ...")

# create dataframe and write to SQL Server
output_dynamic_frame = DynamicFrame.fromDF(output_df, glueContext, 'output_dynamic_frame')
glueContext.write_dynamic_frame_from_catalog(frame = output_dynamic_frame, database="<DATABASE_NAME>", table_name="<TABLE_NAME>")

After writing the data to SQL Server, I want the records that violated SQL Server table constraints to be returned so that they can be uploaded to S3.

Answer 1

I think you can use AWS Glue to extract the data from your DB into S3 and then using Pyspark you can get the "bad records" when reading the S3 files:

corruptDF = (spark.read   
            .option("mode", "PERMISSIVE")  
            .option("columnNameOfCorruptRecord", "_corrupt_record")  
            .csv("s3://bucket-name/path")

Then you can filter by the "columnNameOfCorruptRecord" field and save the "good ones" to your DB and the "bad ones" to an S3 path.

Also, there is functionality from Databricks on handling bad records and files here where you can give a badRecordsPath option when reading the file so the "bag records" are sent to that path. Be advised that this only works when reading csv, json, and any file-based built-in sources (Eg parquet)

Is there a way to write only “good” records to a SQL Server table and return the “bad” records using an AWS Glue job?

Question

1 answers

solution1
1 2019-04-10 17:33:12

Is there a way to write only “good” records to a SQL Server table and return the “bad” records using an AWS Glue job?

Question

1 answers

solution1 1 2019-04-10 17:33:12

solution1
1 2019-04-10 17:33:12