[英]Read json load to pyspark dataframe
I want to ingest data from azure cosmos db, I am using the python sdk for connection in databricks.我想从 azure cosmos db 中提取数据,我正在使用 python sdk 进行数据块连接。
I want to be able to save my json.load(data) into a pyspark dataframe as I need to save the data in databricks delta lake, how can I read this data to pyspark dataframe. I want to be able to save my json.load(data) into a pyspark dataframe as I need to save the data in databricks delta lake, how can I read this data to pyspark dataframe. Below is my code and sample data
下面是我的代码和示例数据
{
"appUuid": "aaaa-bbbb-cccc",
"SystemId": null,
"city": "Lancaster",
"state": "NY",
"zipCode": "140",
"field1": "others",
"field2": "others"
}
{
"appUuid": "bbbb-dddd-eeee",
"SystemId": null,
"city": "Alden ",
"state": "NY",
"zipCode": "140",
"field1": "others",
"field2": "others"
}
from azure.cosmos import CosmosClient
client = CosmosClient('https://<cosmos_client>.documents.azure.com:443/', credential='AccountKey')
DATABASE_NAME = 'TestDB'
database = client.get_database_client(DATABASE_NAME)
CONTAINER_NAME = 'Test'
container = database.get_container_client(CONTAINER_NAME)
import json
for item in container.query_items(
query='SELECT Top 10 * FROM Test',
enable_cross_partition_query=True):
data = json.dumps(item, indent=True)
print(data)
print(type(data))
# converting string to json dict
data1 = json.loads(data)
print(data1)
print(type(data1))
from pyspark.sql import *
from pyspark.sql import SparkSession,Row
spark = SparkSession.builder.getOrCreate()
df = spark.read.json(data1) -- I am getting error on this line.
display(df)
I am getting this error:我收到此错误:
"IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: {"
Try inserting it inside Databricks file system尝试将其插入 Databricks 文件系统
#data here is the intial string dict
dbutils.fs.put("/tmp/data.json", data, True)
df = spark.read.option("multiline", True).json("/tmp/data.json")
You're trying to specify JSON string as a path to data - it doesn't work that way.您正在尝试将 JSON 字符串指定为数据路径 - 它不起作用。
.json()
function either accepts file path, or RDD of strings. .json
.json()
function 要么接受文件路径,要么接受字符串的 RDD。 To create RDD use following code:要创建 RDD,请使用以下代码:
rdd = sc.parallelize([data1])
df = spark.read.json(rdd)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.