简体   繁体   中英

Incremental data load from Redshift to S3 using Pyspark and Glue Jobs

I have created a pipeline where the data ingestion takes place between Redshift and S3. I was able to do the complete load using the below method:

def readFromRedShift(spark: SparkSession, schema, tablename):
        table = str(schema) + str(".") + str(tablename)
        (url, Properties, host, port, db) = con.getConnection("REDSHIFT")
        df = spark.read.jdbc(url=url, table=table, properties=Properties)
        return df

Where getConnection is a different method under a separate class that handles all the redshift-related details. Later on, I used this method and created a data frame, and wrote the results into S3 which worked like a charm.

Now, I want to load the incremental data. Will enabling the Job Bookmarks Glue option help me? Or is there any other way to do it? I followed this official documentation but was of no help to me for my problem statement. So, if I run it for the first time as it will load the complete data, and if I rerun it will it be able to load the newly arrived records?

You are right. It can be achieved via use of job bookmarks, but at the same time it can be a bit tricky. Please refer to this doc https://aws.amazon.com/blogs/big-data/load-data-incrementally-and-optimized-parquet-writer-with-aws-glue/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM