PySpark - JSON to RDD/coalesce

Question

Based on the suggestion to this question I asked earlier , I was able to transform my RDD into a JSON in the format I want. In order to save this to HDFS, I'd like to convert this back to an RDD and use coalesce to save it into 10 partition files.

What I'm doing so far:

convert to an RDD using my_rdd = sc.parallelize([my_json])
coalesce and save using my_rddcoalesce(10).saveAsTextFile

In my tests, this executes successfully but only one of the 10 partition files has data. On further checks, it looks like the entire json file is loaded into the RDD as a single record, as opposed to one record per json element, resulting in the coalesce function being unable to split the data properly.

I tried issuing hadoop fs -text <saved_file_partition> |head -n 1 and the entire JSON was spat out, as opposed to only the first record.

How can I convert my JSON object to an RDD properly?

Answer 1

Since you define RDD like

sc.parallelize([my_json])

it will have only one record and single records are never split between partitions. Therefore it doesn't matter how many partitions you use - there can be only one non-empty partition in your dataset.

PySpark - JSON to RDD/coalesce

Question

1 answers

solution1
0 2018-06-26 10:21:06

PySpark - JSON to RDD/coalesce

Question

1 answers

solution1 0 2018-06-26 10:21:06

solution1
0 2018-06-26 10:21:06