简体   繁体   English

如何使用Spark结构化流逐块处理文件?

[英]How to process files using Spark Structured Streaming chunk by chunk?

I am treating a large amount of files, and I want to treat these files chunk by chunk, let's say that during each batch, I want to treat each 50 files separately. 我正在处理大量文件,并且我想逐块地处理这些文件,比方说,在每个批次中,我想分别处理每个50个文件。

How can I do it using Spark Structured Streaming ? 如何使用Spark结构化流式传输?

I have seen that Jacek Laskowski ( https://stackoverflow.com/users/1305344/jacek-laskowski ) said in a similar question ( Spark to process rdd chunk by chunk from json files and post to Kafka topic ) that it was possible using the Spark Structured Streaming, but I can't find any examples about it. 我已经看到Jacek Laskowski( https://stackoverflow.com/users/1305344/jacek-laskowski )在一个类似的问题( Spark可以从json文件逐块处理rdd并将其发布到Kafka主题 )中说,可以使用Spark结构化流媒体,但我找不到有关它的任何示例。

Thanks a lot, 非常感谢,

If using File Source: 如果使用文件源:

maxFilesPerTrigger: maximum number of new files to be considered in every trigger (default: no max) maxFilesPerTrigger:每个触发器要考虑的最大新文件数(默认值:无最大值)

spark
  .readStream
  .format("json")
  .path("/path/to/files")
  .option("maxFilesPerTrigger", 50)
  .load

If using a Kafka Source it would be similar but with the maxOffsetsPerTrigger option. 如果使用Kafka Source,则类似,但具有maxOffsetsPerTrigger选项。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM