Is there a way to process a json file fom s3 bucket using pyspark without downloading?

Question

I have some large json files in a specific s3 bucket folder. Each file contains json objects per line. I tried to downloading it using spark.read.json(s3a://bucket/prefix/file.json) but got "Premature end of Content-Length delimited message body" error.

I would like to know if there is a way to deal with empty rows in json, while reading it?
How can we read the json line by line and process it? ultimately i need to do some event analysis using the json data.
Can we process/analyze the json from s3 itself without downloading it?

I am using spark 2.4.7 with Hadoop distribution 2.7.1, java 1.8 and python 3.7

Answer 1

Try this:

spark.read.option(
    "multiLine", true
).option(
    "mode", "PERMISSIVE"
).json("/path/file.json")

Is there a way to process a json file fom s3 bucket using pyspark without downloading?

Question

1 answers

solution1
0 2020-10-02 12:44:11

Is there a way to process a json file fom s3 bucket using pyspark without downloading?

Question

1 answers

solution1 0 2020-10-02 12:44:11

solution1
0 2020-10-02 12:44:11