简体   繁体   中英

Reading a json file into a RDD (not dataFrame) using pyspark

I have the following file: test.json >

{
    "id": 1,
    "name": "A green door",
    "price": 12.50,
    "tags": ["home", "green"]
}

I want to load this file into a RDD. This is what I tried:

rddj = sc.textFile('test.json')
rdd_res = rddj.map(lambda x: json.loads(x))

I got an error:

Expecting object: line 1 column 1 (char 0)

I don't completely understand what does json.loads do.

How can I resolve this problem ?

textFile reads data line by line. Individual lines of your input are not syntactically valid JSON.

Just use json reader:

spark.read.json("test.json", multiLine=True)

or (not recommended) whole text files

sc.wholeTextFiles("test.json").values().map(json.loads)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM