[英]Apache Beam - Integration test with unbounded PCollection
We are building an integration test for an Apache Beam pipeline and are running into some issues. 我们正在为Apache Beam管道构建集成测试,并且遇到了一些问题。 See below for context...
请参阅下文了解情况......
Details about our pipeline: 有关我们管道的详情:
PubsubIO
as our data source (unbounded PCollection
) PubsubIO
作为我们的数据源(无界PCollection
) CombineFn
and a very simple windowing/triggering strategy CombineFn
和非常简单的窗口/触发策略 JdbcIO
, using org.neo4j.jdbc.Driver
to write to Neo4j JdbcIO
,使用org.neo4j.jdbc.Driver
写入Neo4j Current testing approach: 目前的测试方法:
OurPipeline.main(TestPipeline.convertToArgs(options)
OurPipeline.main(TestPipeline.convertToArgs(options)
运行管道OurPipeline.main(TestPipeline.convertToArgs(options)
PubsubIO
will read from PubsubIO
将从中读取 This is intended to be a simple integration test which will verify that our pipeline as a whole is behaving as expected. 这是一个简单的集成测试,它将验证我们的整个管道是否按预期运行。
The issue we're currently having is that when we run our pipeline it is blocking. 我们目前面临的问题是,当我们运行我们的管道时,它会阻塞。 We are using
DirectRunner
and pipeline.run()
( not pipeline.run().waitUntilFinish()
), but the test seems to hang after running the pipeline. 我们正在使用
DirectRunner
和pipeline.run()
( 而不是 pipeline.run().waitUntilFinish()
),但测试似乎在运行管道后挂起。 Because this is an unbounded PCollection
(running in streaming mode), the pipeline does not terminate, and thus any code after it is not reached. 因为这是一个无限制的
PCollection
(以流模式运行),所以管道不会终止,因此不会到达任何代码。
So, I have a few questions: 所以,我有几个问题:
1) Is there a way to run a pipeline and then stop it manually later? 1)有没有办法运行管道然后稍后手动停止?
2) Is there a way to run a pipeline asynchronously? 2)有没有办法异步运行管道? Ideally it would just kick off the pipeline (which would then continuously poll Pub/Sub for data) and then move on to the code responsible for publishing to Pub/Sub.
理想情况下,它会启动管道(然后将继续轮询Pub / Sub以获取数据),然后转到负责发布到Pub / Sub的代码。
3) Is this method of integration testing a pipeline reasonable, or are there better methods that might be more straightforward? 3)这种集成测试方法是否合理,或者是否有更好的方法可能更直接? Any info/guidance here would be appreciated.
这里的任何信息/指导将不胜感激。
Let me know if I can provide any additional code/context - thanks! 如果我能提供任何额外的代码/背景,请告诉我 - 谢谢!
You can run the pipeline asynchronously using the DirectRunner
by passing setting the isBlockOnRun
pipeline option to false
. 您可以通过将
isBlockOnRun
管道选项设置为false
来使用DirectRunner
异步运行管道。 So long as you keep a reference to the returned PipelineResult
available, calling cancel()
on that result should stop the pipeline. 只要保留对可用返回的
PipelineResult
的引用,对该结果调用cancel()
就应该停止管道。
For your third question, your setup seems reasonable. 对于第三个问题,您的设置似乎合理。 However, if you want to have a smaller-scale test of your pipeline (requiring fewer components), you can encapsulate all of your processing logic within a custom
PTransform
. 但是,如果您希望对管道进行较小规模的测试(需要较少的组件),则可以将所有处理逻辑封装在自定义
PTransform
。 This PTransform
should take inputs that have been fully parsed from an input source, and produce outputs that are yet to be parsed for the output sink. 此
PTransform
应采用已从输入源完全解析的输入,并生成尚未为输出接收器解析的输出。
When this is done, you can use either Create
(which will generally not exercise triggering) or TestStream
(which may, depending on how you construct the TestStream
) with the DirectRunner
to generate a finite amount of input data, apply this processing PTransform
to that PCollection
, and use PAssert
on the output PCollection
to verify that the pipeline generated the outputs which you expect. 完成后,您可以使用
Create
(通常不会执行触发)或TestStream
(可能,取决于您构建TestStream
)与DirectRunner
生成有限数量的输入数据,将此处理PTransform
应用于此PCollection
,并使用PAssert
在输出PCollection
验证管道产生你所期望的输出。
For more information about testing, the Beam website has information about these styles of tests in the Programming Guide and a blog post about testing pipelines with TestStream
. 有关测试的更多信息,Beam网站在编程指南中提供了有关这些测试样式的信息,以及有关使用
TestStream
测试管道的博客文章 。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.