使用 Python Spark dataframe 读取多行 json 字符串

Question

我正在使用databricks笔记本中的pyspark代码将api的内容读入dataframe。 我验证了 json 有效负载，并且该字符串采用有效的 json 格式。 我猜这个错误是由于多行 json 字符串。 下面的代码与其他 json api 有效载荷一起工作得很好。

火花版本 < 2.2

import requests
user = "usr"
password = "aBc!23"
response = requests.get('https://myapi.com/allcolor', auth=(user, password))
jsondata = response.json()
from pyspark.sql import *
df = spark.read.json(sc.parallelize([jsondata]))
df.show()

JSON 有效载荷：

{
  "colors": [
    {
      "color": "black",
      "category": "hue",
      "type": "primary",
      "code": {
        "rgba": [
          255,
          255,
          255,
          1
        ],
        "hex": "#000"
      }
    },
    {
      "color": "white",
      "category": "value",
      "code": {
        "rgba": [
          0,
          0,
          0,
          1
        ],
        "hex": "#FFF"
      }
    },
    {
      "color": "red",
      "category": "hue",
      "type": "primary",
      "code": {
        "rgba": [
          255,
          0,
          0,
          1
        ],
        "hex": "#FF0"
      }
    },
    {
      "color": "blue",
      "category": "hue",
      "type": "primary",
      "code": {
        "rgba": [
          0,
          0,
          255,
          1
        ],
        "hex": "#00F"
      }
    },
    {
      "color": "yellow",
      "category": "hue",
      "type": "primary",
      "code": {
        "rgba": [
          255,
          255,
          0,
          1
        ],
        "hex": "#FF0"
      }
    },
    {
      "color": "green",
      "category": "hue",
      "type": "secondary",
      "code": {
        "rgba": [
          0,
          255,
          0,
          1
        ],
        "hex": "#0F0"
      }
    }
  ]
}

错误：

pyspark.sql.dataframe.DataFrame = [_corrupt_record: string]

修改后的代码：

spark.sql("set spart.databricks.delta.preview.enabled=true")
spark.sql("set spart.databricks.delta.retentionDutationCheck.preview.enabled=false")
import json
import requests
from requests.auth import HTTPDigestAuth
import pandas as pd
user = "username"
password = "password"
myResponse = requests.get('https://myapi.com/allcolor', auth=(user, password))
if(myResponse.ok):
  jData = json.loads(myResponse.content)
  s1 = json.dumps(jData)
  #load data from api
  x = json.loads(s1)
  data = pd.read_json(json.dumps(x))
  #create dataframe
  spark_df = spark.createDataFrame(data)
  spark_df.show()          
  spark.conf.set("fs.azure.account.key.<your-storage-account-name>.blob.core.windows.net","<your-storage-account-access-key>")
  spark_df.write.mode("overwrite").json("wasbs://<container>@<storage-account-name>.blob.core.windows.net/<directory>/")
else:
  myResponse.raise_for_status()

Output 作为源的格式不正确。

Output 修改：（与源码不同）

{
  "colors": 
    {
      "color": "black",
      "category": "hue",
      "type": "primary",
      "code": {
        "rgba": [
          255,
          255,
          255,
          1
        ],
        "hex": "#000"
      }
    }
    }
{
  "colors":     
    {
      "color": "white",
      "category": "value",
      "code": {
        "rgba": [
          0,
          0,
          0,
          1
        ],
        "hex": "#FFF"
      }
    }
    }

您能否指出我哪里出错了，因为我存储在 ADLS Gen2 中的 output 文件与源 api json 有效负载不匹配。

Answer 1

在调用spark.read.json之前删除新行：

df = spark.read.json(sc.parallelize([jsondata.replace('\n','')]))

df.show(truncate=False)
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|colors                                                                                                                                                                                                                                                                                  |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[[hue, [#000, [255, 255, 255, 1]], black, primary], [value, [#FFF, [0, 0, 0, 1]], white,], [hue, [#FF0, [255, 0, 0, 1]], red, primary], [hue, [#00F, [0, 0, 255, 1]], blue, primary], [hue, [#FF0, [255, 255, 0, 1]], yellow, primary], [hue, [#0F0, [0, 255, 0, 1]], green, secondary]]|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

df.printSchema()
root
 |-- colors: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- category: string (nullable = true)
 |    |    |-- code: struct (nullable = true)
 |    |    |    |-- hex: string (nullable = true)
 |    |    |    |-- rgba: array (nullable = true)
 |    |    |    |    |-- element: long (containsNull = true)
 |    |    |-- color: string (nullable = true)
 |    |    |-- type: string (nullable = true)

使用 Python Spark dataframe 读取多行 json 字符串

问题描述

1 个解决方案

解决方案1
1 2021-03-10 06:42:07

使用 Python Spark dataframe 读取多行 json 字符串

问题描述

1 个解决方案

解决方案1 1 2021-03-10 06:42:07

解决方案1
1 2021-03-10 06:42:07