在 scala 中使用 spark 流从文件夹流式传输时，如何读取包括子文件夹在内的所有文件？

Question

I have some files that I want to stream using spark structured streaming.我有一些我想使用 spark 结构化流处理 ZF7B44CFFAFD5C52223D5498196C8A2E7BZ 的文件。 The structure is something like this:结构是这样的：

myFolder
└── subFolderOne
    ├── fileOne.gz
    ├── fileTwo.gz
    └── fileThree.gz
└── subFolderTwo
    ├── fileFour.gz
    ├── fileFive.gz
    ├── fileSix.gz

When i only do the following, it works:当我只执行以下操作时，它可以工作：

val df = spark
  .readStream
  .format("json")
  .schema(schema)
  .option("maxFilesPerTrigger", 1)
  .json("/myFolder/subFolderOne/")     <-------

but I want to read it at the root level: /myFolder/ so that it picks all files within any number of sub folders.但我想在根级别阅读它： /myFolder/以便它选择任意数量的子文件夹中的所有文件。 Is this possible?这可能吗？

I am using spark 2.4.5 and scala 2.11.6我正在使用火花 2.4.5和scala 2.11.6

Answer 1

So, it turns out to be as simple as this:所以，事实证明就这么简单：

Before:前：

.json("/myFolder/")

After后

.json("/myFolder/*")

在 scala 中使用 spark 流从文件夹流式传输时，如何读取包括子文件夹在内的所有文件？

问题描述

1 个解决方案

解决方案1
0 2021-02-26 10:11:35

在 scala 中使用 spark 流从文件夹流式传输时，如何读取包括子文件夹在内的所有文件？

问题描述

1 个解决方案

解决方案1 0 2021-02-26 10:11:35

解决方案1
0 2021-02-26 10:11:35