简体   繁体   English

想要并行运行Apache Beam Pipeline

[英]Want to run Apache Beam Pipeline in parallel

My problem statement is 我的问题陈述是

1. Need to fetch data from multiple third party source / perform some operation / store the data in some location 1.需要从多个第三方来源获取数据/执行某些操作/将数据存储在某个位置

2. I need to create a dedicated Beam pipeline for each source 2.我需要为每个源创建专用的Beam管道

As i am new to Beam , my question is 我刚接触Beam时,我的问题是

1. If i create separate pipelines for different third party source , will it be good or it can cause some problem ? 1.如果我为不同的第三方来源创建单独的管道,这会很好还是会引起一些问题?

2. If the design is right , then if I run with run beam-runners-direct-java in a single machine , will it act like a parallel processing ? 2.如果设计正确,那么如果我在一台机器上运行run beam-runners-direct-java,它会像并行处理一样工作吗?

Beam has an ultimate plan of supporting many different sources (and eventually they can be even cross languages). Beam有一个支持许多不同来源的最终计划(最终它们甚至可以是跨语言的)。

to your questions, Multiple beam-runner-direct-java in parallel on the single machine won't cause problem. 提出您的问题,在一台机器上并行运行多个Beam-runner-direct-java不会造成问题。 In fact, all the validation tests uses direct runner and the tests do run in parallel. 实际上,所有验证测试都使用直接运行程序,并且这些测试确实并行运行。

One thing unclear is, what is the main reason that you have to create multiple pipelines, one for each 3rd party source? 尚不清楚的一件事是,您必须创建多个管道的主要原因是什么,每个第三方来源都需要一个? if the reason is to have things run parallel for higher throughput, I (biased opinion) think that is not a good idea. 如果原因是为了提高吞吐量而并行运行,我(有偏见)认为这不是一个好主意。 In the long run, even if we introduce feature optimizing parallel sources, you won't be able to benefit from the opt. 从长远来看,即使我们引入了优化并行源的功能,您也将无法从opt中受益。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM