Incremental data load from Redshift to S3 using Pyspark and Glue Jobs

Question

I have created a pipeline where the data ingestion takes place between Redshift and S3. I was able to do the complete load using the below method:

def readFromRedShift(spark: SparkSession, schema, tablename):
        table = str(schema) + str(".") + str(tablename)
        (url, Properties, host, port, db) = con.getConnection("REDSHIFT")
        df = spark.read.jdbc(url=url, table=table, properties=Properties)
        return df

Where getConnection is a different method under a separate class that handles all the redshift-related details. Later on, I used this method and created a data frame, and wrote the results into S3 which worked like a charm.

Now, I want to load the incremental data. Will enabling the Job Bookmarks Glue option help me? Or is there any other way to do it? I followed this official documentation but was of no help to me for my problem statement. So, if I run it for the first time as it will load the complete data, and if I rerun it will it be able to load the newly arrived records?

Answer 1

You are right. It can be achieved via use of job bookmarks, but at the same time it can be a bit tricky. Please refer to this doc https://aws.amazon.com/blogs/big-data/load-data-incrementally-and-optimized-parquet-writer-with-aws-glue/

Incremental data load from Redshift to S3 using Pyspark and Glue Jobs

Question

1 answers

solution1
0 2021-09-28 21:11:10

Incremental data load from Redshift to S3 using Pyspark and Glue Jobs

Question

1 answers

solution1 0 2021-09-28 21:11:10

solution1
0 2021-09-28 21:11:10