如何使用Spark结构化流逐块处理文件？

Question

I am treating a large amount of files, and I want to treat these files chunk by chunk, let's say that during each batch, I want to treat each 50 files separately. 我正在处理大量文件，并且我想逐块地处理这些文件，比方说，在每个批次中，我想分别处理每个50个文件。

How can I do it using Spark Structured Streaming ? 如何使用Spark结构化流式传输？

I have seen that Jacek Laskowski ( https://stackoverflow.com/users/1305344/jacek-laskowski ) said in a similar question ( Spark to process rdd chunk by chunk from json files and post to Kafka topic ) that it was possible using the Spark Structured Streaming, but I can't find any examples about it. 我已经看到Jacek Laskowski（ https://stackoverflow.com/users/1305344/jacek-laskowski ）在一个类似的问题（ Spark可以从json文件逐块处理rdd并将其发布到Kafka主题）中说，可以使用Spark结构化流媒体，但我找不到有关它的任何示例。

Thanks a lot, 非常感谢，

Answer 1

If using File Source: 如果使用文件源：

maxFilesPerTrigger: maximum number of new files to be considered in every trigger (default: no max) maxFilesPerTrigger：每个触发器要考虑的最大新文件数（默认值：无最大值）

spark
  .readStream
  .format("json")
  .path("/path/to/files")
  .option("maxFilesPerTrigger", 50)
  .load

If using a Kafka Source it would be similar but with the maxOffsetsPerTrigger option. 如果使用Kafka Source，则类似，但具有maxOffsetsPerTrigger选项。

如何使用Spark结构化流逐块处理文件？

问题描述

1 个解决方案

解决方案1
0 2018-08-10 09:14:01

如何使用Spark结构化流逐块处理文件？

问题描述

1 个解决方案

解决方案1 0 2018-08-10 09:14:01

解决方案1
0 2018-08-10 09:14:01