简体   繁体   中英

Is there a way to process a json file fom s3 bucket using pyspark without downloading?

I have some large json files in a specific s3 bucket folder. Each file contains json objects per line. I tried to downloading it using spark.read.json(s3a://bucket/prefix/file.json) but got "Premature end of Content-Length delimited message body" error.

  1. I would like to know if there is a way to deal with empty rows in json, while reading it?
  2. How can we read the json line by line and process it? ultimately i need to do some event analysis using the json data.
  3. Can we process/analyze the json from s3 itself without downloading it?

I am using spark 2.4.7 with Hadoop distribution 2.7.1, java 1.8 and python 3.7

Try this:

spark.read.option(
    "multiLine", true
).option(
    "mode", "PERMISSIVE"
).json("/path/file.json")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM