Apache 光束管道摄取“大”输入文件（超过 1GB）不会创建任何输出文件

Question

Regarding the dataflow model of computation, I'm doing a PoC to test a few concepts using apache beam with the direct-runner (and java sdk).关于计算的数据流模型，我正在做一个 PoC 来测试一些概念，使用 apache beam 和直接运行器（和 java sdk）。 I'm having trouble creating a pipeline which reads a "big" csv file (about 1.25GB) and dumping it into an output file without any particular transformation like in the following code (I'm mainly concerned with testing IO bottlenecks using this dataflow/beam model because that's of primary importance for me):我在创建读取“大”csv 文件（约 1.25GB）并将其转储到输出文件中时遇到问题，如以下代码所示（我主要关注使用此数据流测试 IO 瓶颈） /beam 模型，因为这对我来说是最重要的）：

// Example 1 reading and writing to a file
Pipeline pipeline = Pipeline.create();
PCollection<String> output = ipeline
    .apply(TextIO.read().from("BIG_CSV_FILE"));
output.apply(
    TextIO
        .write()
        .to("BIG_OUTPUT")
        .withSuffix("csv").withNumShards(1));
pipeline.run();

The problem that I'm having is that only smaller files do work, but when the big file is used, no output file is being generated (but also no error/exception is shown either, which makes debugging harder).我遇到的问题是只有较小的文件才能工作，但是当使用大文件时，没有生成输出文件（但也没有显示错误/异常，这使得调试更加困难）。

I'm aware that on the runners page of the apache-beam project ( https://beam.apache.org/documentation/runners/direct/ ), it is explicitly stated under the memory considerations point:我知道在 apache-beam 项目（ https://beam.apache.org/documentation/runners/direct/ ）的跑步者页面上，它在内存考虑点下明确说明：

Local execution is limited by the memory available in your local environment.本地执行受本地环境中可用内存的限制。 It is highly recommended that you run your pipeline with data sets small enough to fit in local memory.强烈建议您使用足够小以适合本地内存的数据集运行管道。 You can create a small in-memory data set using a Create transform, or you can use a Read transform to work with small local or remote files.您可以使用创建转换创建一个小的内存数据集，或者您可以使用读取转换来处理小型本地或远程文件。

This above suggests I'm having a memory problem (but sadly isn't being explicitly stated on the console, so I'm just left wondering here).以上表明我遇到了内存问题（但遗憾的是没有在控制台上明确说明，所以我只是想知道这里）。 I'm also concerned with their suggestion that the dataset should fit into memory (why isn't it reading from the file in parts instead of fitting the whole file/dataset into memory?)我也很关心他们的建议，即数据集应该适合内存（为什么它不从文件中分部分读取，而不是将整个文件/数据集装入内存？）

A 2nd consideration I'd like to also add into this conversation would be (in case this is indeed a memory problem): How basic is the implementation of the direct runner?我还想添加到这个对话中的第二个考虑因素是（如果这确实是一个内存问题）：直接运行器的实现有多基本？ I mean, it isn't hard to implement a piece of code that reads from a big file in chunks, and also outputs to a new file (also in chunks), so that at no point in time the memory usage becomes a problem (because neither file is completely loaded into memory - only the current "chunk").我的意思是，实现一段从大文件中分块读取并输出到新文件（也是分块）的代码并不难，因此内存使用在任何时候都不会成为问题（因为这两个文件都没有完全加载到内存中 - 只有当前的“块”）。 Even if the "direct-runner" is more of a prototyping runner to test semantics, would it be too much to expect that it should deal nicely with huge files?即使“直接运行器”更像是测试语义的原型运行器，是否期望它可以很好地处理大文件？ - considering that this is a unified model built for the ground up to deal with streaming where window size is arbitrary and huge data accumulation/aggregation before sinking it is a standard use-case. - 考虑到这是一个为处理流媒体而构建的统一模型，其中窗口大小是任意的，并且在下沉之前大量数据积累/聚合是一个标准用例。

So more than a question I'd deeply appreciate your feedback/comments regarding any of these points: have you notice IO constraints using the direct-runner?因此，不仅仅是一个问题，我非常感谢您对以下任何一点的反馈/评论：您是否注意到使用直接转轮的 IO 限制？ Am I overlooking some aspect or is the direct-runner really so naively implemented?我是否忽略了某些方面，还是直接执行者真的如此天真地实施？ Have you verified that by using a proper production runner like flink/spark/google cloud dataflow, this constraint disapears?您是否通过使用像 flink/spark/google 云数据流这样的合适的生产运行器来验证，这个约束消失了？

I'll eventually test with other runners like the flink or the spark one, but it feels underwhelming that the direct-runner (even if it is intended only for prototyping purposes) is having trouble with this first test I'm running on - considering the whole dataflow idea is based around ingesting, processing, grouping and distributing huge amounts of data under the umbrella of an unified batch/streaming model.我最终会与其他跑步者一起进行测试，例如 flink 或 spark整个数据流思想基于在统一批处理/流模型的保护下摄取、处理、分组和分发大量数据。

EDIT (to reflect Kenn's feedback): Kenn, thanks for those valuable points and feedback, they have been of great help in pointing me towards relevant documentation.编辑（反映 Kenn 的反馈）： Kenn，感谢这些宝贵的观点和反馈，他们在为我指明相关文档方面提供了很大帮助。 By your suggestion I've found out by profiling the application that the problem is indeed a java heap related one (that somehow is never shown on the normal console - and only seen on the profiler).根据您的建议，我通过分析应用程序发现问题确实与 java 堆相关（不知何故从未在普通控制台上显示 - 仅在分析器上看到）。 Even though the file is "only" 1.25GB in size, internal usage goes beyond 4GB before dumping the heap, suggesting the direct-runner isn't "working by chunks" but is indeed loading everything in memory (as their doc says).即使文件大小“只有”1.25GB，在转储堆之前内部使用量超过了4GB，这表明直接运行器不是“按块工作”，而是确实将所有内容加载到内存中（正如他们的文档所说）。

Regarding your points:关于你的观点：

1- I believe that serialization and shuffling can very well still be achieved through a "chunk by chunk" implementation. 1- 我相信序列化和改组仍然可以通过“逐块”实现来很好地实现。 Maybe I had a false expectation of what the direct-runner should be capable of, or I didn't fully grasp its intended reach, for now I'll refrain of doing non-functional type of tests while using the direct-runner.也许我对直接运行器应该具备的能力有错误的期望，或者我没有完全掌握它的预期范围，现在我将避免在使用直接运行器时进行非功能类型的测试。

2 - Regarding sharding. 2 - 关于分片。 I believe the NumOfShards controls the parallelism (and amount of output files) at the write stage (processing before that should still be fully parallel, and only at the time of writing, will it use as many workers -and generate as many files- as explicitly provided).我相信 NumOfShards 在写入阶段控制并行度（和输出文件的数量）（在此之前的处理应该仍然是完全并行的，并且只有在写入时，它才会使用尽可能多的工作人员 - 并生成尽可能多的文件 -明确规定）。 Two reasons to believe this are: first, the CPU profiler always show 8 busy "direct-runner-workers" -mirroring the amount of logical cores that my PC has-, independently on if I set 1 shard or N shards.相信这一点的两个原因是：首先，CPU 分析器始终显示 8 个忙碌的“直接运行器”——反映了我的 PC 拥有的逻辑内核数量——这与我设置 1 个分片还是 N 个分片无关。 The 2nd reason is what I understand from the documentation here ( https://beam.apache.org/releases/javadoc/2.0.0/org/apache/beam/sdk/io/WriteFiles.html ) :第二个原因是我从这里的文档中了解到的（ https://beam.apache.org/releases/javadoc/2.0.0/org/apache/beam/sdk/io/WriteFiles.html ）：

By default, every bundle in the input PCollection will be processed by a FileBasedSink.WriteOperation, so the number of output will vary based on runner behavior, though at least 1 output will always be produced.默认情况下，输入 PCollection 中的每个包都将由 FileBasedSink.WriteOperation 处理，因此输出的数量将根据运行程序的行为而有所不同，但始终会产生至少 1 个输出。 The exact parallelism of the write stage can be controlled using withNumShards(int), typically used to control how many files are produced or to globally limit the number of workers connecting to an external service.可以使用 withNumShards(int)控制写入阶段的确切并行度，通常用于控制生成的文件数量或全局限制连接到外部服务的工作线程的数量。 However, this option can often hurt performance: it adds an additional GroupByKey to the pipeline.但是，此选项通常会损害性能：它会向管道添加一个额外的 GroupByKey。

One interesting thing here is that "additional GroupByKey added to the pipeline" is kind of undesired in my use case (I only desire results in 1 file, without any regard for order or grouping), so probbly adding an extra "flatten" files step, after having the N sharded output files generated is a better approach.这里一件有趣的事情是，“额外的 GroupByKey 添加到管道中”在我的用例中是不受欢迎的（我只想要 1 个文件中的结果，而不考虑顺序或分组），所以可能会添加一个额外的“展平”文件步骤，在生成 N 个分片输出文件之后是更好的方法。

3 - your suggestion for profiling was spot on, thanks. 3 - 你对分析的建议是正确的，谢谢。

Final Edit the direct runner is not intended for performance testing, only prototyping and well formedness of the data. Final Edit直接运行器不用于性能测试，仅用于数据的原型设计和格式良好。 It doen't have any mechanism of spliting and dividing work by partitions, and handles everything in memory它没有任何按分区拆分和划分工作的机制，并处理内存中的所有内容

Answer 1

There are a few issues or possibilities.有几个问题或可能性。 I will answer in priority order.我会按优先顺序回答。

The direct runner is for testing with very small data.直接运行器用于使用非常小的数据进行测试。 It is engineered for maximum quality assurance, with performance not much of a priority.它专为最大程度地保证质量而设计，性能并不是最重要的。 For example:例如：

it randomly shuffles data to make sure you are not depending on ordering that will not exist in production它随机打乱数据以确保您不依赖于生产中不存在的排序
it serializes and deserializes data after each step, to make sure the data will be transmitted correctly (production runners will avoid serialization as much as possible)它在每一步之后对数据进行序列化和反序列化，以确保数据正确传输（生产运行者将尽可能避免序列化）
it checks whether you have mutated elements in forbidden ways, which would cause you data loss in production它检查您是否以禁止的方式改变了元素，这会导致您在生产中丢失数据

The data you are describing is not very big, and the DirectRunner can process it eventually in normal circumstances.你描述的数据不是很大，一般情况下DirectRunner最终可以处理。

You have specified numShards(1) which explicitly eliminates all parallelism.您已指定numShards(1) ，它明确地消除了所有并行性。 It will cause all of the data to be combined and processed in a single thread, so it will be slower than it could be, even on the DirectRunner.它将导致所有数据在单个线程中组合和处理，因此即使在 DirectRunner 上也会比它可能的速度慢。 In general, you will want to avoid artificially limiting parallelism.通常，您会希望避免人为地限制并行性。
If there is any out of memory error or other error preventing processing, you should see a lot message.如果有任何内存不足错误或其他阻止处理的错误，您应该会看到很多消息。 Otherwise, it will be helpful to look at profiling and CPU utilization to determine if processing is active.否则，查看分析和 CPU 利用率以确定处理是否处于活动状态将很有帮助。

Answer 2

This question has been indirectly answered by Kenn Knowles above.上面的 Kenn Knowles 已经间接回答了这个问题。 The direct runner is not intended for performance testing, only prototyping and well formedness of the data.直接流道不用于性能测试，仅用于原型设计和数据的良好形成。 It doen't have any mechanism of spliting and dividing work by partitions, and handles every dataset in memory.它没有任何按分区拆分和划分工作的机制，并处理内存中的每个数据集。 Performance testing should be carried on by using other runners (like Flink Runner), - those will provide data splitting and the type of infrastructure needed to deal with high IO bottlenecks.性能测试应使用其他运行器（如 Flink Runner）进行，这些运行器将提供数据拆分和处理高 IO 瓶颈所需的基础设施类型。

UPDATE: adding to the point adressed by this question, there is a related question here: How to deal with (Apache Beam) high IO bottlenecks?更新：添加到这个问题的重点，这里有一个相关的问题：如何处理（Apache Beam）高 IO 瓶颈？

Whereas the question here revolves around figuring out if the direct runner can deal with huge datasets ( which we already established here that it is not possible );而这里的问题围绕着弄清楚直接运行器是否可以处理庞大的数据集（我们在这里已经确定这是不可能的）； the provided link above points to a discussion of weather production runners (like flink/spark/cloud dataflow) can deal natively out of the box with huge datasets (the short answer is yes, but please check yourself on the link for a deeper discussion).上面提供的链接指向关于天气生产运行程序（如 flink/spark/cloud dataflow）的讨论，可以直接处理庞大的数据集（简短的回答是肯定的，但请查看链接以进行更深入的讨论） .

Apache 光束管道摄取“大”输入文件（超过 1GB）不会创建任何输出文件

问题描述

2 个解决方案

解决方案1
1 已采纳 2021-06-28 18:40:50

解决方案2
0 2021-07-08 13:01:29

Apache 光束管道摄取“大”输入文件（超过 1GB）不会创建任何输出文件

问题描述

2 个解决方案

解决方案1 1 已采纳 2021-06-28 18:40:50

解决方案2 0 2021-07-08 13:01:29

解决方案1
1 已采纳 2021-06-28 18:40:50

解决方案2
0 2021-07-08 13:01:29