Upload data to Redshift with PySpark

Question

I have a script written on pyspark. What I try to do is read *.csv file from S3 bucket in AWS using pyspark.

I create a DataFrame with all data, select all the columns I need and cast them types my Redshift table expects:

    mapping = [('id', StringType), ('session', StringType), ('ip', StringType)]

    df = spark.read.\
        format("csv").\
        option("header", True).\
        load(f"...")
    
    rows_to_map = [field[0] for field in columns_mapping]
    # We need to select only specific columns
    mapped_df = df.select(*rows_to_map)
    # Now need to cast types
    for mapping in columns_mapping:
        mapped_df = mapped_df.withColumn(mapping[0], mapped_df[mapping[0]].cast(mapping[1]()))
    
    mapped_df.printSchema()
    
    mapped_df.write.format("com.databricks.spark.redshift").\
        option("url", "...").\
        option("dbtable", "...").\
        option("tempdir", "...").\
        option("user", "...").\
        option("password", "...").\
        option("aws_iam_role", "...").\
        mode("append").\
        save()

And I receive an error during inserting data into redshift: Check 'stl_load_errors' system table for details.

There I see that it tries to read columns from csv randomly(almost).

SCHEMA of my dataframe:

|-- id: string (nullable = true) 
|-- session: string (nullable = true) 
|-- ip: string (nullable = true)
...

As you can see by first rows it goes like that id -> session -> ip... But my Redshift table shows schema with the same fields but in different order. First 3 rows:

|-- id: string (nullable = true) 
|-- created_at: long (nullable = true) 
|-- session: string (nullable = true)

As a result on the second column he cries that I'm trying to write STRING into LONG column. Instead of created_at he read from the file session.

Question: does order of columns in my DataFrame(tmp_file) is critical? Any solution for that? It will take too much time process each file.

Thanks for any help.

Answer 1

Provide a list of column names in your redshift table, and rearrange the columns in the Spark dataframe before writing:

# redshift table columns, in correct order
colnames = ['id', 'created_at', 'session', ...]   

mapped_df = mapped_df.select(colnames)
mapped_df.write(...)

Upload data to Redshift with PySpark

Question

1 answers

solution1
1 ACCPTED 2020-12-29 17:39:41

Upload data to Redshift with PySpark

Question

1 answers

solution1 1 ACCPTED 2020-12-29 17:39:41

solution1
1 ACCPTED 2020-12-29 17:39:41