简体   繁体   English

在 Pyspark 中读取 Json

[英]Read Json in Pyspark

I want to read a JSON file in PySpark, but the JSON file is in this format (without comma and square brackets):我想在 PySpark 中读取一个 JSON 文件,但 JSON 文件是这种格式(没有逗号和方括号):

{"id": 1, "name": "jhon"}
{"id": 2, "name": "bryan"}
{"id": 3, "name": "jane"}

Is there an easy way to read this JSON in PySpark?有没有一种简单的方法可以在 PySpark 中读取这个 JSON?

I have already tried this code:我已经尝试过这段代码:

df= spark.read.option("multiline", "true").json("data.json")
df.write.parquet("data.parquet")

But it doesn't work: in parquet file just the first line appears.但它不起作用:在 parquet 文件中只出现第一行。

I just want to read this JSON file and save as parquet...我只想阅读这个 JSON 文件并另存为镶木地板......

Try to read as a text file first, and parse it to a json object先尝试读取为文本文件,解析为json对象

from pyspark.sql.functions import from_json, col
import json

lines = spark.read.text("data.json")
parsed_lines = lines.rdd.map(lambda row: json.loads(row[0]))

# Convert JSON objects --> a DataFrame
df = parsed_lines.toDF()
df.write.parquet("data.parquet")

Only the first line appears while reading data from your mentioned file because of multiline parameter is set as True but in this case one line is a JSON object.由于multiline参数设置为True ,因此从您提到的文件读取数据时只出现第一行,但在这种情况下,一行是 JSON 对象。 So if you set multiline parameter as False it will work as expected.因此,如果您将multiline参数设置为False ,它将按预期工作。

df= spark.read.option("multiline", "false").json("data.json")
df.show()

In case if your JSON file would have had a JSON array in file like如果您的 JSON 文件在文件中有一个 JSON 数组,例如

[
{"id": 1, "name": "jhon"},
{"id": 2, "name": "bryan"},
{"id": 3, "name": "jane"}
]

or要么

[
    {
        "id": 1, 
        "name": "jhon"
    },
    {
        "id": 2, 
        "name": "bryan"
    }
]

multiline parameter set to True will work. multiline参数设置为True将起作用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在 PySpark 中,有没有办法将凭据作为变量传递到 spark.read 中? - In PySpark, is there a way to pass credentials as variables into spark.read? 有什么办法可以动态读取json数据吗? - Is there any way to read json data in dynamic way? 如何使用 JSON_EXTRACT 或 JSON_EXTRACT_SCALAR 在 Big Query 中读取多级 JSON 数据 - how to read multiple levels of JSON data in Big Query using JSON_EXTRACT or JSON_EXTRACT_SCALAR 满负荷使用 PySpark - Use PySpark at full capacity 使用 PySpark 解析嵌套的 XML - Parsing Nested XML with PySpark 解析错误:遵循 Firebase Cloud Functions 初始化说明后无法读取文件 '\tsconfig.json' eslint - Parsing error: Cannot read file '\tsconfig.json' eslint after following Firebase Cloud Functions initialization instructions 寻找一种使用 firebase 在无服务器 flutter 中存储和读取 json 文件的安全方法 - Looking for a secure way to store and read json file in serverless flutter using firebase 如何从谷歌数据流 apache 光束 python 中的 GCS 存储桶中读取多个 JSON 文件 - How to read multiple JSON files from GCS bucket in google dataflow apache beam python PySpark 中没有名为“spacy”的模块 - No module named 'spacy' in PySpark 您能否在 AWS Glue 中使用 PySpark 而不是 Glue PySpark? - Can you use PySpark instead of Glue PySpark in AWS Glue?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM