简体繁体 English

Apache Beam - 使用无界PCollection进行集成测试

[英]Apache Beam - Integration test with unbounded PCollection

原文 2017-06-23 18:28:31 9 1 java/ integration-testing/ google-cloud-dataflow/ google-cloud-pubsub/ apache-beam

We are building an integration test for an Apache Beam pipeline and are running into some issues. 我们正在为Apache Beam管道构建集成测试，并且遇到了一些问题。 See below for context... 请参阅下文了解情况......

Details about our pipeline: 有关我们管道的详情：

We use PubsubIO as our data source (unbounded PCollection ) 我们使用PubsubIO作为我们的数据源（无界PCollection ）
Intermediate transforms include a custom CombineFn and a very simple windowing/triggering strategy 中间变换包括自定义CombineFn和非常简单的窗口/触发策略
Our final transform is JdbcIO , using org.neo4j.jdbc.Driver to write to Neo4j 我们的最终转换是JdbcIO ，使用org.neo4j.jdbc.Driver写入Neo4j

Current testing approach: 目前的测试方法：

Run Google Cloud's Pub/Sub emulator on the machine that the tests are running on 在运行测试的计算机上运行Google Cloud的Pub / Sub模拟器
Build an in-memory Neo4j database and pass its URI into our pipeline options 构建内存中的Neo4j数据库并将其URI传递给我们的管道选项
Run pipeline by calling OurPipeline.main(TestPipeline.convertToArgs(options) 通过调用OurPipeline.main(TestPipeline.convertToArgs(options)运行管道OurPipeline.main(TestPipeline.convertToArgs(options)
Use Google Cloud's Java Pub/Sub client library to publish messages to a test topic (using Pub/Sub emulator), which PubsubIO will read from 使用Google Cloud的Java Pub / Sub客户端库将消息发布到测试主题（使用Pub / Sub仿真器）， PubsubIO将从中读取
Data should flow through the pipeline and eventually hit our in-memory instance of Neo4j 数据应该流经管道并最终命中我们的内存中的Neo4j实例
Make simple assertions regarding the presence of this data in Neo4j 在Neo4j中对这些数据的存在做出简单的断言

This is intended to be a simple integration test which will verify that our pipeline as a whole is behaving as expected. 这是一个简单的集成测试，它将验证我们的整个管道是否按预期运行。

The issue we're currently having is that when we run our pipeline it is blocking. 我们目前面临的问题是，当我们运行我们的管道时，它会阻塞。 We are using DirectRunner and pipeline.run() ( not pipeline.run().waitUntilFinish() ), but the test seems to hang after running the pipeline. 我们正在使用DirectRunner和pipeline.run() （ 而不是 pipeline.run().waitUntilFinish() ），但测试似乎在运行管道后挂起。 Because this is an unbounded PCollection (running in streaming mode), the pipeline does not terminate, and thus any code after it is not reached. 因为这是一个无限制的PCollection （以流模式运行），所以管道不会终止，因此不会到达任何代码。

So, I have a few questions: 所以，我有几个问题：

1) Is there a way to run a pipeline and then stop it manually later? 1）有没有办法运行管道然后稍后手动停止？

2) Is there a way to run a pipeline asynchronously? 2）有没有办法异步运行管道？ Ideally it would just kick off the pipeline (which would then continuously poll Pub/Sub for data) and then move on to the code responsible for publishing to Pub/Sub. 理想情况下，它会启动管道（然后将继续轮询Pub / Sub以获取数据），然后转到负责发布到Pub / Sub的代码。

3) Is this method of integration testing a pipeline reasonable, or are there better methods that might be more straightforward? 3）这种集成测试方法是否合理，或者是否有更好的方法可能更直接？ Any info/guidance here would be appreciated. 这里的任何信息/指导将不胜感激。

Let me know if I can provide any additional code/context - thanks! 如果我能提供任何额外的代码/背景，请告诉我 - 谢谢！

1 个解决方案

You can run the pipeline asynchronously using the DirectRunner by passing setting the isBlockOnRun pipeline option to false . 您可以通过将isBlockOnRun管道选项设置为false来使用DirectRunner异步运行管道。 So long as you keep a reference to the returned PipelineResult available, calling cancel() on that result should stop the pipeline. 只要保留对可用返回的PipelineResult的引用，对该结果调用cancel()就应该停止管道。

For your third question, your setup seems reasonable. 对于第三个问题，您的设置似乎合理。 However, if you want to have a smaller-scale test of your pipeline (requiring fewer components), you can encapsulate all of your processing logic within a custom PTransform . 但是，如果您希望对管道进行较小规模的测试（需要较少的组件），则可以将所有处理逻辑封装在自定义PTransform 。 This PTransform should take inputs that have been fully parsed from an input source, and produce outputs that are yet to be parsed for the output sink. 此PTransform应采用已从输入源完全解析的输入，并生成尚未为输出接收器解析的输出。

When this is done, you can use either Create (which will generally not exercise triggering) or TestStream (which may, depending on how you construct the TestStream ) with the DirectRunner to generate a finite amount of input data, apply this processing PTransform to that PCollection , and use PAssert on the output PCollection to verify that the pipeline generated the outputs which you expect. 完成后，您可以使用Create （通常不会执行触发）或TestStream （可能，取决于您构建TestStream ）与DirectRunner生成有限数量的输入数据，将此处理PTransform应用于此PCollection ，并使用PAssert在输出PCollection验证管道产生你所期望的输出。

For more information about testing, the Beam website has information about these styles of tests in the Programming Guide and a blog post about testing pipelines with TestStream . 有关测试的更多信息，Beam网站在编程指南中提供了有关这些测试样式的信息，以及有关使用TestStream测试管道的博客文章。