[英]How do I read in all files including subfolders when streaming from folder using spark streaming in scala?
I have some files that I want to stream using spark structured streaming.我有一些我想使用 spark 结构化流处理 ZF7B44CFFAFD5C52223D5498196C8A2E7BZ 的文件。 The structure is something like this:结构是这样的:
myFolder
└── subFolderOne
├── fileOne.gz
├── fileTwo.gz
└── fileThree.gz
└── subFolderTwo
├── fileFour.gz
├── fileFive.gz
├── fileSix.gz
When i only do the following, it works:当我只执行以下操作时,它可以工作:
val df = spark
.readStream
.format("json")
.schema(schema)
.option("maxFilesPerTrigger", 1)
.json("/myFolder/subFolderOne/") <-------
but I want to read it at the root level: /myFolder/
so that it picks all files within any number of sub folders.但我想在根级别阅读它: /myFolder/
以便它选择任意数量的子文件夹中的所有文件。 Is this possible?这可能吗?
I am using spark 2.4.5 and scala 2.11.6我正在使用火花 2.4.5和scala 2.11.6
So, it turns out to be as simple as this:所以,事实证明就这么简单:
Before:前:
.json("/myFolder/")
After后
.json("/myFolder/*")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.