简体   繁体   English

在 scala 中使用 spark 流从文件夹流式传输时,如何读取包括子文件夹在内的所有文件?

[英]How do I read in all files including subfolders when streaming from folder using spark streaming in scala?

I have some files that I want to stream using spark structured streaming.我有一些我想使用 spark 结构化流处理 ZF7B44CFFAFD5C52223D5498196C8A2E7BZ 的文件。 The structure is something like this:结构是这样的:

myFolder
└── subFolderOne
    ├── fileOne.gz
    ├── fileTwo.gz
    └── fileThree.gz
└── subFolderTwo
    ├── fileFour.gz
    ├── fileFive.gz
    ├── fileSix.gz

When i only do the following, it works:当我只执行以下操作时,它可以工作:

val df = spark
  .readStream
  .format("json")
  .schema(schema)
  .option("maxFilesPerTrigger", 1)
  .json("/myFolder/subFolderOne/")     <-------

but I want to read it at the root level: /myFolder/ so that it picks all files within any number of sub folders.但我想在根级别阅读它: /myFolder/以便它选择任意数量的子文件夹中的所有文件。 Is this possible?这可能吗?

I am using spark 2.4.5 and scala 2.11.6我正在使用火花 2.4.5scala 2.11.6

So, it turns out to be as simple as this:所以,事实证明就这么简单:

Before:前:

.json("/myFolder/") 

After

.json("/myFolder/*") 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM