简体   繁体   English

Pyspark:从路径读取多个 JSON 文件

[英]Pyspark: reading multiple JSON files from path

I am trying to read 20 JSON files from a path and create a dataframe, but even though the schema gets created, the df only contains null values ("Query returned no results").我试图从路径中读取 20 个 JSON 文件并创建一个数据帧,但即使创建了架构,df 也只包含空值(“查询没有返回结果”)。 Here is my code:这是我的代码:

ordersJsonPath = "dbfs:/user/xyz/dbacademy/raw/orders/stream" 
ordersDF = spark.read.schema(userDefinedSchema).json(ordersJsonPath)

When I run the code with a specific JSON file, it works.当我使用特定的 JSON 文件运行代码时,它可以工作。

"dbfs:/user/xyz/dbacademy/raw/orders/stream/order_0612a18b-0cc7-43ea-9f5b-155aad967cb9_2020-01-01.json"

From what I understood, I need to create a schema manually when working with JSON or am I confusing things here and a manual schema is obsolete when working with multiple files?据我所知,我需要在使用 JSON 时手动创建模式,或者我是否在这里混淆了并且在使用多个文件时手动模式已过时?

Thank you in advance!先感谢您!

Edit: I checked the schema of the first 5 files in the path, they are all the same and they can all be read by the query I wrote (not streamed though, as this requires to give the whole path as input).编辑:我检查了路径中前 5 个文件的架构,它们都是相同的,并且它们都可以通过我编写的查询读取(虽然不是流式传输,因为这需要将整个路径作为输入)。 When I start the readStream query, not even these first 5 JSON files get processed.当我开始 readStream 查询时,甚至没有处理前 5 个 JSON 文件。 Just nothing happens:什么都没有发生: 在此处输入图片说明

Another edit: I got it solved, needed to add /* to the path ...另一个编辑:我解决了,需要将 /* 添加到路径...

解决了,需要将/*添加到路径...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM