pyspark df.write writing(parquet) to S3 but data is missing in half the columns

Question

Using EMR w 4 workers and 1 master

Release label:emr-5.24.0
Hadoop distribution:Amazon 2.8.5
Applications:Spark 2.4.2, Hive 2.3.4

I am able to process my data and create the correct dataframe in pyspark. But when I write (parquet)the df out to S3, the files are indeed placed in S3 in the correct location, but 3 of the 7 columns are suddenly missing data.

Can anyone explain what I need to do to fix this? Here are the relevant code and result screenshots. I've renamed some columns in the screenshots to maintain privacy.

My code:

# For multi tables
df_multi.show(5)
df_multi.printSchema()
print("\n At line 578, after show(), writing to EDL\n")
df_multi.write.mode("append").parquet(multi_s3_bucket_dir)
print("\n  SCRIPT COMPLETED  \n")

A screenshot of the output when the script runs. I run it as nohup and redirect stdin & sterr to a file to see later:

And here is the output, read from S3 using Athena:

Answer 1

Mea culpa. Problem solved. My column names in the df did not exactly match the column names in the Athena DDL. Because parquet is 'schema-on-read', the system created a schema matching the df, but it could only import those columns whose name DID match, leaving the rest empty.

Lesson learned.

pyspark df.write writing(parquet) to S3 but data is missing in half the columns

Question

1 answers

solution1
0 2019-06-18 20:50:29

pyspark df.write writing(parquet) to S3 but data is missing in half the columns

Question

1 answers

solution1 0 2019-06-18 20:50:29

solution1
0 2019-06-18 20:50:29