Apache 梁流水线步骤未并行运行？（Python）

Question

我使用了 wordcount 示例的略微修改版本（ https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount.py ），将过程 function 替换为以下内容：

  def process(self, element):
    """Returns an iterator over the words of this element.
    The element is a line of text.  If the line is blank, note that, too.
    Args:
      element: the element being processed
    Returns:
      The processed element.
    """
    import random
    import time
    n = random.randint(0, 1000)
    time.sleep(5)
    logging.getLogger().warning('PARALLEL START? ' + str(n))
    time.sleep(5)

    text_line = element.strip()
    if not text_line:
      self.empty_line_counter.inc(1)
    words = re.findall(r'[\w\']+', text_line, re.UNICODE)
    for w in words:
      self.words_counter.inc()
      self.word_lengths_counter.inc(len(w))
      self.word_lengths_dist.update(len(w))

    time.sleep(5)
    logging.getLogger().warning('PARALLEL END? ' + str(n))
    time.sleep(5)

    return words

这个想法是检查该步骤是否正在并行执行。 例如，预期的 output 将是：

PARALLEL START? 447
PARALLEL START? 994
PARALLEL END? 447
PARALLEL START? 351
PARALLEL START? 723
PARALLEL END? 994
PARALLEL END? 351
PARALLEL END? 723

但是，实际结果是这样的，这表明该步骤没有并行运行：

PARALLEL START? 447
PARALLEL END? 447
PARALLEL START? 994
PARALLEL END? 994
PARALLEL START? 351
PARALLEL END? 351
PARALLEL START? 723
PARALLEL END? 723

我尝试使用手动设置 Direct_num_workers 的 LocalRunner，以及将 DataflowRunner 与多个工作人员一起使用，但无济于事。 可以做些什么来确保这一步实际上是并行运行的？

更新：这里发现的多处理模式看起来很有希望。 但是，在 Windows 命令行（ python wordcount.py --region us-east1 --setup_file setup.py --input_file gs://dataflow-samples/shakespeare/kinglear.txt --output output/ ）上，我收到以下信息使用时出错：

Exception in thread run_worker:
Traceback (most recent call last):
    File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\threading.py", line 926, in _bootstrap_inner
        self.run()
    File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\threading.py", line 870, in run
        self._target(*self._args, **self._kwargs)
    File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\site-packages\apache_beam\runners\portability\local_job_service.py", line 218, in run
        p = subprocess.Popen(self._worker_command_line, shell=True, env=env_dict)
    File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\subprocess.py", line 775, in __init__
        restore_signals, start_new_session)
    File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\subprocess.py", line 1119, in _execute_child
        args = list2cmdline(args)
    File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\subprocess.py", line 530, in list2cmdline
        needquote = (" " in arg) or ("\t" in arg) or not arg
TypeError: argument of type 'int' is not iterable

Answer 1

标准的 Apache Beam 示例使用了非常小的数据输入： gs://dataflow-samples/shakespeare/kinglear.txt只有几个 KB，所以它不会很好地拆分工作。

Apache Beam 通过拆分输入数据来实现并行化。 例如，如果您有很多文件，每个文件将被并行使用。 如果您有一个非常大的文件，Beam 能够将该文件拆分为将并行使用的段。

您是正确的，您的代码最终应该显示并行性发生 - 但尝试使用（显着）更大的输入。

Apache 梁流水线步骤未并行运行？（Python）

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-06-25 16:52:57

Apache 梁流水线步骤未并行运行？ （Python）

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-06-25 16:52:57

Apache 梁流水线步骤未并行运行？（Python）

解决方案1
1 已采纳 2020-06-25 16:52:57