I have a script written on pyspark. What I try to do is read *.csv file from S3 bucket in AWS using pyspark.
I create a DataFrame with all data, select all the columns I need and cast them types my Redshift table expects:
mapping = [('id', StringType), ('session', StringType), ('ip', StringType)]
df = spark.read.\
format("csv").\
option("header", True).\
load(f"...")
rows_to_map = [field[0] for field in columns_mapping]
# We need to select only specific columns
mapped_df = df.select(*rows_to_map)
# Now need to cast types
for mapping in columns_mapping:
mapped_df = mapped_df.withColumn(mapping[0], mapped_df[mapping[0]].cast(mapping[1]()))
mapped_df.printSchema()
mapped_df.write.format("com.databricks.spark.redshift").\
option("url", "...").\
option("dbtable", "...").\
option("tempdir", "...").\
option("user", "...").\
option("password", "...").\
option("aws_iam_role", "...").\
mode("append").\
save()
And I receive an error during inserting data into redshift: Check 'stl_load_errors' system table for details.
There I see that it tries to read columns from csv randomly(almost).
SCHEMA of my dataframe:
|-- id: string (nullable = true)
|-- session: string (nullable = true)
|-- ip: string (nullable = true)
...
As you can see by first rows it goes like that id -> session -> ip... But my Redshift table shows schema with the same fields but in different order. First 3 rows:
|-- id: string (nullable = true)
|-- created_at: long (nullable = true)
|-- session: string (nullable = true)
As a result on the second column he cries that I'm trying to write STRING into LONG column. Instead of created_at he read from the file session.
Question: does order of columns in my DataFrame(tmp_file) is critical? Any solution for that? It will take too much time process each file.
Thanks for any help.
Provide a list of column names in your redshift table, and rearrange the columns in the Spark dataframe before writing:
# redshift table columns, in correct order
colnames = ['id', 'created_at', 'session', ...]
mapped_df = mapped_df.select(colnames)
mapped_df.write(...)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.