简体繁体 English

Parquet 文件输出接收器 - Spark 结构化流

[英]Parquet File Output Sink - Spark Structured Streaming

原文 2019-03-27 17:27:33 9 1 apache-spark/ spark-structured-streaming

Wondering what (and how to modify) triggers a Spark Sturctured Streaming Query (with Parquet File output sink configured) to write data to the parquet files.想知道什么（以及如何修改）会触发 Spark Sturctured Streaming Query（配置了 Parquet File 输出接收器）将数据写入 Parquet 文件。 I periodically feed the Stream input data (using StreamReader to read in files), but it does not write output to Parquet file for each file provided as input.我定期提供 Stream 输入数据（使用 StreamReader 读入文件），但它不会为作为输入提供的每个文件将输出写入 Parquet 文件。 Once I have given it a few files, it tends to write a Parquet file just fine.一旦我给了它几个文件，它就会写一个 Parquet 文件就好了。

I am wondering how to control this.我想知道如何控制这个。 I would like to be able force a new write to Parquet file for every new file provided as input.我希望能够为作为输入提供的每个新文件强制写入 Parquet 文件。 Any tips appreciated!任何提示表示赞赏！

Note: I have maxFilesPerTrigger set to 1 on the Read Stream call.注意：我在 Read Stream 调用中将 maxFilesPerTrigger 设置为 1。 I am also seeing the Streaming Query process the single input file, however a single file on input does not appear to result in the Streaming Query writing the output to the Parquet file我还看到流查询处理单个输入文件，但是输入上的单个文件似乎不会导致流查询将输出写入 Parquet 文件

1 个解决方案

After further analysis, and working with the ForEach output sink using the default Append Mode, I believe the issue I was running into was the combination of the Append mode along with the Watermarking feature.经过进一步分析，并使用默认的附加模式处理 ForEach 输出接收器，我相信我遇到的问题是附加模式与水印功能的组合。

After re-reading https://spark.apache.org/docs/2.2.1/structured-streaming-programming-guide.html#starting-streaming-queries It appears that when the Append mode is used with a watermark set, the Spark structured steaming will not write out aggregation results to the Result table until the watermark time limit has past.重读https://spark.apache.org/docs/2.2.1/structured-streaming-programming-guide.html#starting-streaming-queries 后看来，当 Append 模式与水印集一起使用时，在超过水印时间限制之前，Spark 结构化流不会将聚合结果写入结果表。 Append mode does not allow updates to records, so it must wait for the watermark to past, to ensure no change the row...追加模式不允许更新记录，因此必须等待水印过去，以确保不会更改行...

I believe - the Parquet File sink does not allow for the Update mode, howver after switching to the ForEach output Sink, and using the Update mode, I observed data coming out the sink as I expected.我相信 - Parquet 文件接收器不允许使用更新模式，但是在切换到 ForEach 输出接收器并使用更新模式后，我观察到数据按预期从接收器中流出。 Essentially for each record in, at least one record out, with no delay (as was observed before).基本上对于每条记录，至少有一条记录输出，没有延迟（如之前所观察到的）。

Hopefully this is helpful to others.希望这对其他人有帮助。