简体   繁体   English

Apache 梁流水线步骤未并行运行? (Python)

[英]Apache Beam pipeline step not running in parallel? (Python)

I used a slightly modified version of the wordcount example ( https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount.py ), replacing the process function with the following:我使用了 wordcount 示例的略微修改版本( https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount.py ),将过程 function 替换为以下内容:

  def process(self, element):
    """Returns an iterator over the words of this element.
    The element is a line of text.  If the line is blank, note that, too.
    Args:
      element: the element being processed
    Returns:
      The processed element.
    """
    import random
    import time
    n = random.randint(0, 1000)
    time.sleep(5)
    logging.getLogger().warning('PARALLEL START? ' + str(n))
    time.sleep(5)

    text_line = element.strip()
    if not text_line:
      self.empty_line_counter.inc(1)
    words = re.findall(r'[\w\']+', text_line, re.UNICODE)
    for w in words:
      self.words_counter.inc()
      self.word_lengths_counter.inc(len(w))
      self.word_lengths_dist.update(len(w))

    time.sleep(5)
    logging.getLogger().warning('PARALLEL END? ' + str(n))
    time.sleep(5)

    return words

The idea is to check that the step is being performed in parallel.这个想法是检查该步骤是否正在并行执行。 The expected output would be, for instance:例如,预期的 output 将是:

PARALLEL START? 447
PARALLEL START? 994
PARALLEL END? 447
PARALLEL START? 351
PARALLEL START? 723
PARALLEL END? 994
PARALLEL END? 351
PARALLEL END? 723

However, the actual result is something like this, which indicates that the step is not running in parallel:但是,实际结果是这样的,这表明该步骤没有并行运行:

PARALLEL START? 447
PARALLEL END? 447
PARALLEL START? 994
PARALLEL END? 994
PARALLEL START? 351
PARALLEL END? 351
PARALLEL START? 723
PARALLEL END? 723

I've tried using the LocalRunner with direct_num_workers manually set, as well as using DataflowRunner with multiple workers, to no avail.我尝试使用手动设置 Direct_num_workers 的 LocalRunner,以及将 DataflowRunner 与多个工作人员一起使用,但无济于事。 What can be done to ensure that this step is actually run in parallel?可以做些什么来确保这一步实际上是并行运行的?

Update: the multi-processing mode found here looks promising.更新: 这里发现的多处理模式看起来很有希望。 However, on Windows command line ( python wordcount.py --region us-east1 --setup_file setup.py --input_file gs://dataflow-samples/shakespeare/kinglear.txt --output output/ ), I receive the following error when using it:但是,在 Windows 命令行( python wordcount.py --region us-east1 --setup_file setup.py --input_file gs://dataflow-samples/shakespeare/kinglear.txt --output output/ )上,我收到以下信息使用时出错:

Exception in thread run_worker:
Traceback (most recent call last):
    File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\threading.py", line 926, in _bootstrap_inner
        self.run()
    File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\threading.py", line 870, in run
        self._target(*self._args, **self._kwargs)
    File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\site-packages\apache_beam\runners\portability\local_job_service.py", line 218, in run
        p = subprocess.Popen(self._worker_command_line, shell=True, env=env_dict)
    File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\subprocess.py", line 775, in __init__
        restore_signals, start_new_session)
    File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\subprocess.py", line 1119, in _execute_child
        args = list2cmdline(args)
    File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\subprocess.py", line 530, in list2cmdline
        needquote = (" " in arg) or ("\t" in arg) or not arg
TypeError: argument of type 'int' is not iterable

The standard Apache Beam example uses a very small data input: gs://dataflow-samples/shakespeare/kinglear.txt is only a few KBs, so it will not split the work well.标准的 Apache Beam 示例使用了非常小的数据输入: gs://dataflow-samples/shakespeare/kinglear.txt只有几个 KB,所以它不会很好地拆分工作。

Apache Beam does work parallelization by splitting up the input data. Apache Beam 通过拆分输入数据来实现并行化。 For example, if you have many files, each file will be consumed in parallel.例如,如果您有很多文件,每个文件将被并行使用。 If you have a file that is very large, Beam is able to split that file into segments that will be consumed in parallel.如果您有一个非常大的文件,Beam 能够将该文件拆分为将并行使用的段。

You are correct that your code should eventually show parallelism happening - but try with a (significantly) larger input.您是正确的,您的代码最终应该显示并行性发生 - 但尝试使用(显着)更大的输入。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM