![](/img/trans.png)
[英]Parsing multiline nested json in Spark 3 dataframe using pyspark
[英]Read multiline json string using Python Spark dataframe
我正在使用databricks筆記本中的pyspark代碼將api的內容讀入dataframe。 我驗證了 json 有效負載,並且該字符串采用有效的 json 格式。 我猜這個錯誤是由於多行 json 字符串。 下面的代碼與其他 json api 有效載荷一起工作得很好。
火花版本 < 2.2
import requests
user = "usr"
password = "aBc!23"
response = requests.get('https://myapi.com/allcolor', auth=(user, password))
jsondata = response.json()
from pyspark.sql import *
df = spark.read.json(sc.parallelize([jsondata]))
df.show()
JSON 有效載荷:
{
"colors": [
{
"color": "black",
"category": "hue",
"type": "primary",
"code": {
"rgba": [
255,
255,
255,
1
],
"hex": "#000"
}
},
{
"color": "white",
"category": "value",
"code": {
"rgba": [
0,
0,
0,
1
],
"hex": "#FFF"
}
},
{
"color": "red",
"category": "hue",
"type": "primary",
"code": {
"rgba": [
255,
0,
0,
1
],
"hex": "#FF0"
}
},
{
"color": "blue",
"category": "hue",
"type": "primary",
"code": {
"rgba": [
0,
0,
255,
1
],
"hex": "#00F"
}
},
{
"color": "yellow",
"category": "hue",
"type": "primary",
"code": {
"rgba": [
255,
255,
0,
1
],
"hex": "#FF0"
}
},
{
"color": "green",
"category": "hue",
"type": "secondary",
"code": {
"rgba": [
0,
255,
0,
1
],
"hex": "#0F0"
}
}
]
}
錯誤:
pyspark.sql.dataframe.DataFrame = [_corrupt_record: string]
修改后的代碼:
spark.sql("set spart.databricks.delta.preview.enabled=true")
spark.sql("set spart.databricks.delta.retentionDutationCheck.preview.enabled=false")
import json
import requests
from requests.auth import HTTPDigestAuth
import pandas as pd
user = "username"
password = "password"
myResponse = requests.get('https://myapi.com/allcolor', auth=(user, password))
if(myResponse.ok):
jData = json.loads(myResponse.content)
s1 = json.dumps(jData)
#load data from api
x = json.loads(s1)
data = pd.read_json(json.dumps(x))
#create dataframe
spark_df = spark.createDataFrame(data)
spark_df.show()
spark.conf.set("fs.azure.account.key.<your-storage-account-name>.blob.core.windows.net","<your-storage-account-access-key>")
spark_df.write.mode("overwrite").json("wasbs://<container>@<storage-account-name>.blob.core.windows.net/<directory>/")
else:
myResponse.raise_for_status()
Output 作為源的格式不正確。
Output 修改:(與源碼不同)
{
"colors":
{
"color": "black",
"category": "hue",
"type": "primary",
"code": {
"rgba": [
255,
255,
255,
1
],
"hex": "#000"
}
}
}
{
"colors":
{
"color": "white",
"category": "value",
"code": {
"rgba": [
0,
0,
0,
1
],
"hex": "#FFF"
}
}
}
您能否指出我哪里出錯了,因為我存儲在 ADLS Gen2 中的 output 文件與源 api json 有效負載不匹配。
在調用spark.read.json
之前刪除新行:
df = spark.read.json(sc.parallelize([jsondata.replace('\n','')]))
df.show(truncate=False)
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|colors |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[[hue, [#000, [255, 255, 255, 1]], black, primary], [value, [#FFF, [0, 0, 0, 1]], white,], [hue, [#FF0, [255, 0, 0, 1]], red, primary], [hue, [#00F, [0, 0, 255, 1]], blue, primary], [hue, [#FF0, [255, 255, 0, 1]], yellow, primary], [hue, [#0F0, [0, 255, 0, 1]], green, secondary]]|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
df.printSchema()
root
|-- colors: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- category: string (nullable = true)
| | |-- code: struct (nullable = true)
| | | |-- hex: string (nullable = true)
| | | |-- rgba: array (nullable = true)
| | | | |-- element: long (containsNull = true)
| | |-- color: string (nullable = true)
| | |-- type: string (nullable = true)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.