如何分析 Python Dataflow 作業？

Question

我編寫了一個 Python Dataflow 作業來處理一些數據：

pipeline
| "read" >> beam.io.ReadFromText(known_args.input)  # 9 min 44 sec
| "parse_line" >> beam.Map(parse_line)  # 4 min 55 sec
| "add_key" >> beam.Map(add_key)  # 48 sec
| "group_by_key" >> beam.GroupByKey()  # 11 min 56 sec
| "map_values" >> beam.ParDo(MapValuesFn())  # 11 min 40 sec
| "json_encode" >> beam.Map(json.dumps)  # 26 sec
| "output" >> beam.io.textio.WriteToText(known_args.output)  # 22 sec

（我已經刪除了特定於業務的語言。）

輸入是 1.36 GiB gz 壓縮的 CSV，但該作業需要 37 分 34 秒才能運行（我正在使用 Dataflow，因為我預計輸入的大小會快速增長）。

如何識別管道中的瓶頸並加快其執行速度？ 沒有一個單獨的函數在計算上是昂貴的。

來自 Dataflow 控制台的自動擴縮信息：

12:00:35 PM     Starting a pool of 1 workers. 
12:05:02 PM     Autoscaling: Raised the number of workers to 2 based on the rate of progress in the currently running step(s).
12:10:02 PM     Autoscaling: Reduced the number of workers to 1 based on the rate of progress in the currently running step(s).
12:29:09 PM     Autoscaling: Raised the number of workers to 3 based on the rate of progress in the currently running step(s).
12:35:10 PM     Stopping worker pool.

Answer 1

我搜索了dev@beam.apache.org ，發現有一個討論這個話題的線程： https : dev@beam.apache.org

如果需要，您可以查看此線程以獲取有用的信息和/或提出問題/要求/討論。

Answer 2

偶然地，我發現這種情況下的問題是 CSV 的壓縮。

輸入是單個gz 壓縮的 CSV。 所以我可以更輕松地檢查數據，我切換到未壓縮的 CSV。 這將處理時間減少到 17 分鍾以下，並且 Dataflow 的自動縮放達到了 10 名工人的峰值。

（如果我仍然需要壓縮，我會將 CSV 分成幾部分，然后單獨壓縮每一部分。）

Answer 3

我遇到了谷歌的這個 Python Profiler 包： https : //cloud.google.com/profiler/docs/profiling-python

如何分析 Python Dataflow 作業？

問題描述

3 個解決方案

解決方案1
0 2019-07-08 17:26:52

解決方案2
0 2019-07-16 15:52:07

解決方案3
0 2021-05-20 02:45:36

如何分析 Python Dataflow 作業？

問題描述

3 個解決方案

解決方案1 0 2019-07-08 17:26:52

解決方案2 0 2019-07-16 15:52:07

解決方案3 0 2021-05-20 02:45:36

解決方案1
0 2019-07-08 17:26:52

解決方案2
0 2019-07-16 15:52:07

解決方案3
0 2021-05-20 02:45:36