Pyspark：从路径读取多个 JSON 文件

Question

I am trying to read 20 JSON files from a path and create a dataframe, but even though the schema gets created, the df only contains null values ("Query returned no results").我试图从路径中读取 20 个 JSON 文件并创建一个数据帧，但即使创建了架构，df 也只包含空值（“查询没有返回结果”）。 Here is my code:这是我的代码：

ordersJsonPath = "dbfs:/user/xyz/dbacademy/raw/orders/stream" 
ordersDF = spark.read.schema(userDefinedSchema).json(ordersJsonPath)

When I run the code with a specific JSON file, it works.当我使用特定的 JSON 文件运行代码时，它可以工作。

"dbfs:/user/xyz/dbacademy/raw/orders/stream/order_0612a18b-0cc7-43ea-9f5b-155aad967cb9_2020-01-01.json"

From what I understood, I need to create a schema manually when working with JSON or am I confusing things here and a manual schema is obsolete when working with multiple files?据我所知，我需要在使用 JSON 时手动创建模式，或者我是否在这里混淆了并且在使用多个文件时手动模式已过时？

Thank you in advance!先感谢您！

Edit: I checked the schema of the first 5 files in the path, they are all the same and they can all be read by the query I wrote (not streamed though, as this requires to give the whole path as input).编辑：我检查了路径中前 5 个文件的架构，它们都是相同的，并且它们都可以通过我编写的查询读取（虽然不是流式传输，因为这需要将整个路径作为输入）。 When I start the readStream query, not even these first 5 JSON files get processed.当我开始 readStream 查询时，甚至没有处理前 5 个 JSON 文件。 Just nothing happens:什么都没有发生：

Another edit: I got it solved, needed to add /* to the path ...另一个编辑：我解决了，需要将 /* 添加到路径...

Answer 1

解决了，需要将/*添加到路径...

Pyspark：从路径读取多个 JSON 文件

问题描述

1 个解决方案

解决方案1
0 2021-07-23 19:34:58

Pyspark：从路径读取多个 JSON 文件

问题描述

1 个解决方案

解决方案1 0 2021-07-23 19:34:58

解决方案1
0 2021-07-23 19:34:58