简体繁体 English

如何在Google Cloud Platform上使用Python进行尴尬的并行任务以最大化吞吐量？

[英]How can I maximize throughput for an embarrassingly-parallel task in Python on Google Cloud Platform?

原文 2018-02-18 04:21:59 5 1 python/ parallel-processing/ google-cloud-platform/ google-cloud-dataflow/ apache-beam

I am trying to use Apache Beam/Google Cloud Dataflow to speed up an existing Python application. 我正在尝试使用Apache Beam / Google Cloud Dataflow来加速现有的Python应用程序。 The bottleneck of the application occurs after randomly permuting an input matrix N (default 125, but could be more) times, when the system runs a clustering algorithm on each matrix. 当系统在每个矩阵上运行聚类算法时，在将输入矩阵随机排列N （默认值为125，但可能更多）之后，就会出现应用程序的瓶颈。 The runs are fully independent of one another. 运行完全相互独立。 I've captured the top of the pipeline below: 我已经捕获了以下管道的顶部：

This processes the default 125 permutations. 这将处理默认的125个排列。 As you can see, only the RunClustering step takes an appreciable amount of time (there are 11 more steps not shown below that total to 11 more seconds). 如您所见，仅RunClustering步骤花费了相当多的时间（下面还有11个未显示的步骤，总计11个秒）。 I ran the pipeline earlier today for just 1 permutation, and the Run Clustering step takes 3 seconds (close enough to 1/125th the time shown above). 我今天早些时候运行管道仅进行了1个排列，“ Run Clustering步骤花费了3秒（足够接近上面显示时间的1/125）。

I'd like the RunClustering step to finish in 3-4 seconds no matter what the input N is. 无论输入N是多少，我都希望RunClustering步骤在3-4秒内完成。 My understanding is that Dataflow is the correct tool for speeding up embarrassingly-parallel computation on Google Cloud Platform, so I've spent a couple weeks learning it and porting my code. 我的理解是，Dataflow是在Google Cloud Platform上加速令人尴尬的并行计算的正确工具，因此我花了几周的时间来学习它并移植我的代码。 Is my understanding correct? 我的理解正确吗？ I've also tried throwing more machines at the problem (instead of Autoscaling, which, for whatever reason, only scales up to 2-3 machines*) and specifying more powerful machine types, but those don't help. 我还尝试过解决这个问题（而不是Autoscaling，无论出于何种原因，它只能扩展到2-3台机器*），并且指定了更强大的机器类型，但这些方法无济于事。

*Is this because of a long startup time for VMs? *这是因为VM的启动时间很长吗？ Is there a way to use quickly-provisioned VMs, if that's the case? 如果是这种情况，是否可以使用快速配置的VM？ Another question I have is how to cut down on the pipeline startup time; 我还有一个问题是如何减少管道启动时间。 it's a deal breaker if users can't get results back quickly, and the fact that the total Dataflow job time is 13–14 minutes (compared to the already excessive 6–7 for the pipeline) is unacceptable. 如果用户不能快速获得结果，这将是一个交易突破，并且Dataflow总工作时间为13–14分钟（相比之下，管道已经非常多的6–7）是不可接受的。

1 个解决方案

Your pipeline is suffering from excessive fusion , and ends up doing almost everything on one worker. 您的管道正在遭受过度的融合，最终几乎只能由一名工人完成所有工作。 This is also why autoscaling doesn't scale higher: it detects that it is unable to parallelize your job's code, so it prefers not to waste extra workers. 这也是为什么自动缩放比例不会更高的原因：它检测到无法并行化工作代码，因此它不希望浪费额外的工人。 This is also why manually throwing more workers at the problem doesn't help. 这也是为什么手动让更多的工人解决问题没有帮助的原因。

In general fusion is a very important optimization, but excessive fusion is also a common problem that, ideally, Dataflow would be able to mitigate automatically (like it automatically mitigates imbalanced sharding ), but it is even harder to do, though some ideas for that are in the works. 通常，融合是非常重要的优化，但是过度融合也是一个普遍的问题，理想情况下，Dataflow能够自动缓解（例如它可以自动缓解不平衡分片），但是这样做更加困难，尽管有一些想法在工作中。

Meanwhile, you'll need to change your code to insert a "reshuffle" (a group by key / ungroup will do - fusion never happens across a group by key operation). 同时，您需要更改代码以插入“ reshuffle”（按键分组/取消分组即可-融合永远不会在按键分组的情况下发生）。 See Preventing fusion ; 参见防止融合 ; the question Best way to prevent fusion in Google Dataflow? 问题防止Google Dataflow中融合的最佳方法？ contains some example code. 包含一些示例代码。