简体   繁体   中英

pyspark df.write writing(parquet) to S3 but data is missing in half the columns

Using EMR w 4 workers and 1 master

  • Release label:emr-5.24.0
  • Hadoop distribution:Amazon 2.8.5
  • Applications:Spark 2.4.2, Hive 2.3.4

I am able to process my data and create the correct dataframe in pyspark. But when I write (parquet)the df out to S3, the files are indeed placed in S3 in the correct location, but 3 of the 7 columns are suddenly missing data.

Can anyone explain what I need to do to fix this? Here are the relevant code and result screenshots. I've renamed some columns in the screenshots to maintain privacy.

My code:

# For multi tables
df_multi.show(5)
df_multi.printSchema()
print("\n At line 578, after show(), writing to EDL\n")
df_multi.write.mode("append").parquet(multi_s3_bucket_dir)
print("\n  SCRIPT COMPLETED  \n")

A screenshot of the output when the script runs. I run it as nohup and redirect stdin & sterr to a file to see later: 运行时的截图

And here is the output, read from S3 using Athena: 雅典娜查询

Mea culpa. Problem solved. My column names in the df did not exactly match the column names in the Athena DDL. Because parquet is 'schema-on-read', the system created a schema matching the df, but it could only import those columns whose name DID match, leaving the rest empty.

Lesson learned.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM