[英]Read Json in Pyspark
I want to read a JSON file in PySpark, but the JSON file is in this format (without comma and square brackets):我想在 PySpark 中读取一个 JSON 文件,但 JSON 文件是这种格式(没有逗号和方括号):
{"id": 1, "name": "jhon"}
{"id": 2, "name": "bryan"}
{"id": 3, "name": "jane"}
Is there an easy way to read this JSON in PySpark?有没有一种简单的方法可以在 PySpark 中读取这个 JSON?
I have already tried this code:我已经尝试过这段代码:
df= spark.read.option("multiline", "true").json("data.json")
df.write.parquet("data.parquet")
But it doesn't work: in parquet file just the first line appears.但它不起作用:在 parquet 文件中只出现第一行。
I just want to read this JSON file and save as parquet...我只想阅读这个 JSON 文件并另存为镶木地板......
Try to read as a text file first, and parse it to a json object先尝试读取为文本文件,解析为json对象
from pyspark.sql.functions import from_json, col
import json
lines = spark.read.text("data.json")
parsed_lines = lines.rdd.map(lambda row: json.loads(row[0]))
# Convert JSON objects --> a DataFrame
df = parsed_lines.toDF()
df.write.parquet("data.parquet")
Only the first line appears while reading data from your mentioned file because of multiline
parameter is set as True
but in this case one line is a JSON object.由于multiline
参数设置为True
,因此从您提到的文件读取数据时只出现第一行,但在这种情况下,一行是 JSON 对象。 So if you set multiline
parameter as False
it will work as expected.因此,如果您将multiline
参数设置为False
,它将按预期工作。
df= spark.read.option("multiline", "false").json("data.json")
df.show()
In case if your JSON file would have had a JSON array in file like如果您的 JSON 文件在文件中有一个 JSON 数组,例如
[
{"id": 1, "name": "jhon"},
{"id": 2, "name": "bryan"},
{"id": 3, "name": "jane"}
]
or要么
[
{
"id": 1,
"name": "jhon"
},
{
"id": 2,
"name": "bryan"
}
]
multiline
parameter set to True
will work. multiline
参数设置为True
将起作用。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.