繁体   English   中英

Apache 梁流水线步骤未并行运行? (Python)

[英]Apache Beam pipeline step not running in parallel? (Python)

我使用了 wordcount 示例的略微修改版本( https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount.py ),将过程 function 替换为以下内容:

  def process(self, element):
    """Returns an iterator over the words of this element.
    The element is a line of text.  If the line is blank, note that, too.
    Args:
      element: the element being processed
    Returns:
      The processed element.
    """
    import random
    import time
    n = random.randint(0, 1000)
    time.sleep(5)
    logging.getLogger().warning('PARALLEL START? ' + str(n))
    time.sleep(5)

    text_line = element.strip()
    if not text_line:
      self.empty_line_counter.inc(1)
    words = re.findall(r'[\w\']+', text_line, re.UNICODE)
    for w in words:
      self.words_counter.inc()
      self.word_lengths_counter.inc(len(w))
      self.word_lengths_dist.update(len(w))

    time.sleep(5)
    logging.getLogger().warning('PARALLEL END? ' + str(n))
    time.sleep(5)

    return words

这个想法是检查该步骤是否正在并行执行。 例如,预期的 output 将是:

PARALLEL START? 447
PARALLEL START? 994
PARALLEL END? 447
PARALLEL START? 351
PARALLEL START? 723
PARALLEL END? 994
PARALLEL END? 351
PARALLEL END? 723

但是,实际结果是这样的,这表明该步骤没有并行运行:

PARALLEL START? 447
PARALLEL END? 447
PARALLEL START? 994
PARALLEL END? 994
PARALLEL START? 351
PARALLEL END? 351
PARALLEL START? 723
PARALLEL END? 723

我尝试使用手动设置 Direct_num_workers 的 LocalRunner,以及将 DataflowRunner 与多个工作人员一起使用,但无济于事。 可以做些什么来确保这一步实际上是并行运行的?

更新: 这里发现的多处理模式看起来很有希望。 但是,在 Windows 命令行( python wordcount.py --region us-east1 --setup_file setup.py --input_file gs://dataflow-samples/shakespeare/kinglear.txt --output output/ )上,我收到以下信息使用时出错:

Exception in thread run_worker:
Traceback (most recent call last):
    File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\threading.py", line 926, in _bootstrap_inner
        self.run()
    File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\threading.py", line 870, in run
        self._target(*self._args, **self._kwargs)
    File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\site-packages\apache_beam\runners\portability\local_job_service.py", line 218, in run
        p = subprocess.Popen(self._worker_command_line, shell=True, env=env_dict)
    File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\subprocess.py", line 775, in __init__
        restore_signals, start_new_session)
    File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\subprocess.py", line 1119, in _execute_child
        args = list2cmdline(args)
    File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\subprocess.py", line 530, in list2cmdline
        needquote = (" " in arg) or ("\t" in arg) or not arg
TypeError: argument of type 'int' is not iterable

标准的 Apache Beam 示例使用了非常小的数据输入: gs://dataflow-samples/shakespeare/kinglear.txt只有几个 KB,所以它不会很好地拆分工作。

Apache Beam 通过拆分输入数据来实现并行化。 例如,如果您有很多文件,每个文件将被并行使用。 如果您有一个非常大的文件,Beam 能够将该文件拆分为将并行使用的段。

您是正确的,您的代码最终应该显示并行性发生 - 但尝试使用(显着)更大的输入。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM