Merge Spark RDDs from bad JSON

Question

I have a lot json files, however they aren't formatted correctly for Spark. I don't want to write code to specifically convert them to the correct format by normalizing each dict on each line.

Instead I am hoping to use spark to parse their content. I have the following

import json

import os

json_dir = '/data/original/TEMP'
df = sc.wholeTextFiles(os.path.join(json_dir,'*.json'))
j_docs = df.map(lambda x: json.loads(x[1])).cache()

This works fine and j_docs is essentially a list of lists. For example, the first item in j_docs is a list of dicts from the first file.

I would like to combine all of these individual lists into one large RDD. Ideally without having to run a collect on the data.

Thanks

Answer 1

使用下面的flatMap代替地图

j_docs = df.flatMap(lambda x: json.loads(x[1])).cache()

Merge Spark RDDs from bad JSON

Question

1 answers

solution1
1 ACCPTED 2016-07-05 15:25:57

Merge Spark RDDs from bad JSON

Question

1 answers

solution1 1 ACCPTED 2016-07-05 15:25:57

solution1
1 ACCPTED 2016-07-05 15:25:57