Reading a json file into a RDD (not dataFrame) using pyspark

Question

I have the following file: test.json >

{
    "id": 1,
    "name": "A green door",
    "price": 12.50,
    "tags": ["home", "green"]
}

I want to load this file into a RDD. This is what I tried:

rddj = sc.textFile('test.json')
rdd_res = rddj.map(lambda x: json.loads(x))

I got an error:

Expecting object: line 1 column 1 (char 0)

I don't completely understand what does json.loads do.

How can I resolve this problem ?

Answer 1

textFile reads data line by line. Individual lines of your input are not syntactically valid JSON.

Just use json reader:

spark.read.json("test.json", multiLine=True)

or (not recommended) whole text files

sc.wholeTextFiles("test.json").values().map(json.loads)