简体   繁体   English

读取 json 加载到 pyspark dataframe

[英]Read json load to pyspark dataframe

I want to ingest data from azure cosmos db, I am using the python sdk for connection in databricks.我想从 azure cosmos db 中提取数据,我正在使用 python sdk 进行数据块连接。

I want to be able to save my json.load(data) into a pyspark dataframe as I need to save the data in databricks delta lake, how can I read this data to pyspark dataframe. I want to be able to save my json.load(data) into a pyspark dataframe as I need to save the data in databricks delta lake, how can I read this data to pyspark dataframe. Below is my code and sample data下面是我的代码和示例数据

{
 "appUuid": "aaaa-bbbb-cccc",
 "SystemId": null,
 "city": "Lancaster",
 "state": "NY",
 "zipCode": "140",
 "field1": "others",
 "field2": "others"
}
{
 "appUuid": "bbbb-dddd-eeee",
 "SystemId": null,
 "city": "Alden ",
 "state": "NY",
 "zipCode": "140",
 "field1": "others",
 "field2": "others"
}
from azure.cosmos import CosmosClient

client = CosmosClient('https://<cosmos_client>.documents.azure.com:443/', credential='AccountKey')
DATABASE_NAME = 'TestDB'
database = client.get_database_client(DATABASE_NAME)
CONTAINER_NAME = 'Test'
container = database.get_container_client(CONTAINER_NAME)

import json
for item in container.query_items(
         query='SELECT Top 10 * FROM Test',
        enable_cross_partition_query=True):
    data = json.dumps(item, indent=True)
    print(data)
    print(type(data))

# converting string to json dict
data1 = json.loads(data)
print(data1)
print(type(data1))


from pyspark.sql import *
from pyspark.sql import SparkSession,Row
spark = SparkSession.builder.getOrCreate()

df = spark.read.json(data1) -- I am getting error on this line.
display(df)

I am getting this error:我收到此错误:

"IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: {"

Try inserting it inside Databricks file system尝试将其插入 Databricks 文件系统

#data here is the intial string dict
dbutils.fs.put("/tmp/data.json", data, True)
df = spark.read.option("multiline", True).json("/tmp/data.json")

You're trying to specify JSON string as a path to data - it doesn't work that way.您正在尝试将 JSON 字符串指定为数据路径 - 它不起作用。 .json() function either accepts file path, or RDD of strings. .json .json() function 要么接受文件路径,要么接受字符串的 RDD。 To create RDD use following code:要创建 RDD,请使用以下代码:

rdd = sc.parallelize([data1])
df = spark.read.json(rdd)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM