在 Pyspark 中读取 Json

Question

I want to read a JSON file in PySpark, but the JSON file is in this format (without comma and square brackets):我想在 PySpark 中读取一个 JSON 文件，但 JSON 文件是这种格式（没有逗号和方括号）：

{"id": 1, "name": "jhon"}
{"id": 2, "name": "bryan"}
{"id": 3, "name": "jane"}

Is there an easy way to read this JSON in PySpark?有没有一种简单的方法可以在 PySpark 中读取这个 JSON？

I have already tried this code:我已经尝试过这段代码：

df= spark.read.option("multiline", "true").json("data.json")
df.write.parquet("data.parquet")

But it doesn't work: in parquet file just the first line appears.但它不起作用：在 parquet 文件中只出现第一行。

I just want to read this JSON file and save as parquet...我只想阅读这个 JSON 文件并另存为镶木地板......

Answer 1

Try to read as a text file first, and parse it to a json object先尝试读取为文本文件，解析为json对象

from pyspark.sql.functions import from_json, col
import json

lines = spark.read.text("data.json")
parsed_lines = lines.rdd.map(lambda row: json.loads(row[0]))

# Convert JSON objects --> a DataFrame
df = parsed_lines.toDF()
df.write.parquet("data.parquet")

Answer 2

Only the first line appears while reading data from your mentioned file because of multiline parameter is set as True but in this case one line is a JSON object.由于multiline参数设置为True ，因此从您提到的文件读取数据时只出现第一行，但在这种情况下，一行是 JSON 对象。 So if you set multiline parameter as False it will work as expected.因此，如果您将multiline参数设置为False ，它将按预期工作。

df= spark.read.option("multiline", "false").json("data.json")
df.show()

In case if your JSON file would have had a JSON array in file like如果您的 JSON 文件在文件中有一个 JSON 数组，例如

[
{"id": 1, "name": "jhon"},
{"id": 2, "name": "bryan"},
{"id": 3, "name": "jane"}
]

or要么

[
    {
        "id": 1, 
        "name": "jhon"
    },
    {
        "id": 2, 
        "name": "bryan"
    }
]

multiline parameter set to True will work. multiline参数设置为True将起作用。

在 Pyspark 中读取 Json

问题描述

2 个解决方案

解决方案1
0 2022-12-20 04:15:03

解决方案2
0 已采纳 2022-12-20 10:08:32

在 Pyspark 中读取 Json

问题描述

2 个解决方案

解决方案1 0 2022-12-20 04:15:03

解决方案2 0 已采纳 2022-12-20 10:08:32

解决方案1
0 2022-12-20 04:15:03

解决方案2
0 已采纳 2022-12-20 10:08:32