简体   繁体   中英

Merge Spark RDDs from bad JSON

I have a lot json files, however they aren't formatted correctly for Spark. I don't want to write code to specifically convert them to the correct format by normalizing each dict on each line.

Instead I am hoping to use spark to parse their content. I have the following

import json

import os

json_dir = '/data/original/TEMP'
df = sc.wholeTextFiles(os.path.join(json_dir,'*.json'))
j_docs = df.map(lambda x: json.loads(x[1])).cache()

This works fine and j_docs is essentially a list of lists. For example, the first item in j_docs is a list of dicts from the first file.

I would like to combine all of these individual lists into one large RDD. Ideally without having to run a collect on the data.

Thanks

使用下面的flatMap代替地图

j_docs = df.flatMap(lambda x: json.loads(x[1])).cache()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM