我們可以使用for循環動態創建apache光束數據流管道嗎？

Question

我們可以使用for循環動態創建apache光束數據流管道嗎？ 我擔心當我將 for 循環與數據流運行器一起使用時，它在分布式環境中的行為方式。 我相信這將適用於直接跑步者

例如，我可以像這樣動態創建管道：

 with beam.Pipeline(options=pipeline_options) as pipeline: for p in cdata['tablelist']: i_file_path = p['sourcefile'] schemauri = p['schemauri'] schema=getschema(schemauri) dest_table_id = p['targettable'] ( pipeline | "Read From Input Datafile" + dest_table_id >> beam.io.ReadFromText(i_file_path) | "Convert to Dict" + dest_table_id >> beam.Map(lambda r: data_ingestion.parse_method(r)) | "Write to BigQuery Table" + dest_table_id >> beam.io.WriteToBigQuery('{0}:{1}'.format(project_name, dest_table_id), schema=schema, write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND) )

Answer 1

是的，這是完全合法的，許多管道（尤其是 ML 管道）都是以這種方式構建的。 您上面的循環管道結構應該適用於所有跑步者。

您可以認為 Beam 管道有兩個階段：構建和執行。 第一個階段，即構造，完全發生在主程序中，可以有任意循環、控制語句等。在幕后，這構建了一個由延遲操作（如讀取、映射等）執行的 DAG。 如果你有一個循環，每次迭代都會簡單地對這個圖進行 append 更多操作。 在這個階段你唯一不能做的就是檢查數據（即 PCollection 的內容）本身。

第二階段，執行，在調用pipeline.run()時開始。 （對於 Python，這是在退出with塊時隱式調用的）。 此時，管道圖（如上構建）、它的依賴關系、管道選項等被傳遞給Runner ，后者將實際執行完全指定的圖，理想情況下是並行的。

這在編程指南中有一些介紹，盡管我同意它可能會更清楚。

Answer 2

我認為這是不可能的。

你有很多其他的解決方案可以做到這一點。

如果你有一個像Cloud Composer/Airflow或Cloud Workflows這樣的編排器，你可以把這個邏輯放在這個編排器中，實例化並在循環中為每個元素啟動一個Dataflow作業：

解決方案 1，以 Airflow 為例：

for p in cdata['tablelist']:
      i_file_path = p['sourcefile']
      schemauri = p['schemauri']
      dest_table_id = p['targettable']

      options = {
          'i_file_path': i_file_path,
          'dest_table_id': dest_table_id,
          'schemauri' : schemauri,
          ...
      }

      dataflow_task = DataflowCreatePythonJobOperator(
          py_file=beam_main_file_path,
          task_id=f'task_{dest_table_id}',
          dataflow_default_options=your_default_options,
          options=options,
          gcp_conn_id="google_cloud_default"
      )
      
      # You can execute your Dataflow jobs in parallel
      dataflow_task >> DummyOperator(task_id='END', dag=dag)

解決方案 2，使用 shell 腳本：

for module_path in ${tablelist}; do
   # Options
   i_file_path = ...
   schemauri = ...
   dest_table_id = ...

   #Python command to execute the Dataflow job
   python -m your_module.main \
        --runner=DataflowRunner \
        --staging_location=gs://your_staging_location/ \
        --temp_location=gs://your_temp_location/ \
        --region=europe-west1 \
        --setup_file=./setup.py \
        --i_file_path=$i_file_path \
        --schemauri=$schemauri \
        --dest_table_id=$dest_table_id

在這種情況下， Dataflow作業按順序執行。

如果要啟動的文件和Dataflow作業太多，您可以考慮另一種解決方案。 使用shell script或cloud function ，您可以獲得所有需要的文件並按預期重命名它們（文件名上帶有元數據），將它們移動到 G8911C4BCS666 中的單獨GCS中。

然后在單個Dataflow作業中：

通過pattern讀取所有以前的文件
從文件名解析元數據，如schemauri和dest_table_id
在當前元素的作業中應用 map 操作
將結果寫入Bigquery

如果您沒有大量文件，則前兩種解決方案更簡單。

我們可以使用for循環動態創建apache光束數據流管道嗎？

問題描述

2 個解決方案

解決方案1
1 2022-09-12 17:24:19

解決方案2
0 2022-09-11 21:54:58

我們可以使用for循環動態創建apache光束數據流管道嗎？

問題描述

2 個解決方案

解決方案1 1 2022-09-12 17:24:19

解決方案2 0 2022-09-11 21:54:58

解決方案1
1 2022-09-12 17:24:19

解決方案2
0 2022-09-11 21:54:58