Apache 梁流水線步驟未並行運行？（Python）

Question

我使用了 wordcount 示例的略微修改版本（ https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount.py ），將過程 function 替換為以下內容：

  def process(self, element):
    """Returns an iterator over the words of this element.
    The element is a line of text.  If the line is blank, note that, too.
    Args:
      element: the element being processed
    Returns:
      The processed element.
    """
    import random
    import time
    n = random.randint(0, 1000)
    time.sleep(5)
    logging.getLogger().warning('PARALLEL START? ' + str(n))
    time.sleep(5)

    text_line = element.strip()
    if not text_line:
      self.empty_line_counter.inc(1)
    words = re.findall(r'[\w\']+', text_line, re.UNICODE)
    for w in words:
      self.words_counter.inc()
      self.word_lengths_counter.inc(len(w))
      self.word_lengths_dist.update(len(w))

    time.sleep(5)
    logging.getLogger().warning('PARALLEL END? ' + str(n))
    time.sleep(5)

    return words

這個想法是檢查該步驟是否正在並行執行。 例如，預期的 output 將是：

PARALLEL START? 447
PARALLEL START? 994
PARALLEL END? 447
PARALLEL START? 351
PARALLEL START? 723
PARALLEL END? 994
PARALLEL END? 351
PARALLEL END? 723

但是，實際結果是這樣的，這表明該步驟沒有並行運行：

PARALLEL START? 447
PARALLEL END? 447
PARALLEL START? 994
PARALLEL END? 994
PARALLEL START? 351
PARALLEL END? 351
PARALLEL START? 723
PARALLEL END? 723

我嘗試使用手動設置 Direct_num_workers 的 LocalRunner，以及將 DataflowRunner 與多個工作人員一起使用，但無濟於事。 可以做些什么來確保這一步實際上是並行運行的？

更新：這里發現的多處理模式看起來很有希望。 但是，在 Windows 命令行（ python wordcount.py --region us-east1 --setup_file setup.py --input_file gs://dataflow-samples/shakespeare/kinglear.txt --output output/ ）上，我收到以下信息使用時出錯：

Exception in thread run_worker:
Traceback (most recent call last):
    File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\threading.py", line 926, in _bootstrap_inner
        self.run()
    File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\threading.py", line 870, in run
        self._target(*self._args, **self._kwargs)
    File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\site-packages\apache_beam\runners\portability\local_job_service.py", line 218, in run
        p = subprocess.Popen(self._worker_command_line, shell=True, env=env_dict)
    File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\subprocess.py", line 775, in __init__
        restore_signals, start_new_session)
    File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\subprocess.py", line 1119, in _execute_child
        args = list2cmdline(args)
    File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\subprocess.py", line 530, in list2cmdline
        needquote = (" " in arg) or ("\t" in arg) or not arg
TypeError: argument of type 'int' is not iterable

Answer 1

標准的 Apache Beam 示例使用了非常小的數據輸入： gs://dataflow-samples/shakespeare/kinglear.txt只有幾個 KB，所以它不會很好地拆分工作。

Apache Beam 通過拆分輸入數據來實現並行化。 例如，如果您有很多文件，每個文件將被並行使用。 如果您有一個非常大的文件，Beam 能夠將該文件拆分為將並行使用的段。

您是正確的，您的代碼最終應該顯示並行性發生 - 但嘗試使用（顯着）更大的輸入。

Apache 梁流水線步驟未並行運行？（Python）

問題描述

1 個解決方案

解決方案1
1 已采納 2020-06-25 16:52:57

Apache 梁流水線步驟未並行運行？ （Python）

問題描述

1 個解決方案

解決方案1 1 已采納 2020-06-25 16:52:57

Apache 梁流水線步驟未並行運行？（Python）

解決方案1
1 已采納 2020-06-25 16:52:57