简体   繁体   English

如何使用Apache Beam在Python中将有界pcollection转换为无界?

[英]How to transform bounded pcollection to unbounded in Python with Apache Beam?

I'm trying to transform a few Terabytes of mail logs stored in GCS without using too much memory. 我正在尝试转换GTB中存储的数TB的邮件日志,而不使用太多内存。

As recommended in the guide , I'm adding timestamp to each element, split it to a sliding window and I have specified an (aggregation) trigger before sending it to a GroupByKey and a ParDo parser after. 按照指南中的建议,我将时间戳记添加到每个元素,将其拆分到一个滑动窗口中,并且在将其发送到GroupByKey和之后将其发送到ParDo解析器之前,我已经指定了(聚合)触发器。 This should do, but still GroupByKey waits for all data to arrive. 应该可以,但是GroupByKey仍然等待所有数据到达。 Why? 为什么?

I have tried it with Direct and Google Dataflow runner as well. 我也尝试过Direct和Google Dataflow运行程序。

What do I miss? 我想念什么?

Here's the gist of the code: 这是代码的要点:

    known_args, pipeline_options = parse_args(sys.argv)
    pipeline = beam.Pipeline(options=pipeline_options)
    lines = pipeline | 'read' >> beam.io.ReadFromText(known_args.input)
    parsed_json = lines \
      | 'cut' >> beam.Map(lambda x: x[39:])\
      | 'jsonify' >> beam.Map(json_loads).with_outputs('invalid_input', main='main')

    keyed_lines = parsed_json['main']\
      | 'key_filter' >> beam.Filter(lambda x: u'key' in x)\
      | 'keyer' >> beam.Map(lambda x: (x['key'], x))\
      | 'log_processed_rows' >> beam.Map(log_processed_rows)

    window_trigger = trigger.DefaultTrigger()

    windowed_lines = keyed_lines\
       | 'timestamp' >> beam.ParDo(AddTimestampDoFn())\
       | 'window' >> beam.WindowInto(beam.window.SlidingWindows(size=3600+600, period=3600), trigger=window_trigger, 
                                     accumulation_mode=trigger.AccumulationMode.DISCARDING) # 1 hour steps with 10 minutes of overlap

    results = windowed_lines \
        | 'groupbykey' >> beam.GroupByKey()\
        | 'parse' >> beam.ParDo(ParseSendingsDoFn()).with_outputs('too_few_rows', 'invalid_rows', 'missing_recipients', main='main_valid')  

    output = results['main_valid'] \
        | 'format' >> beam.Map(output_format)\
        | 'write' >> beam.io.WriteToText(known_args.output, file_name_suffix=".gz")

The default trigger requires the entire data set to be available before processing that is why it waits for all data to arrive. 默认触发器要求在处理之前整个数据集都可用,这就是为什么它等待所有数据到达的原因。

Try using a Data-driven triggers or a Processing time triggers . 尝试使用数据驱动触发器或处理时间触发器

Furthermore, the SlidingWindows is mostly used for running averages and since you are only adding a timestamp the single global window might be a better option. 此外, SlidingWindows通常用于运行平均值,并且由于仅添加时间戳,因此单个全局窗口可能是更好的选择。

EDIT 编辑

Regarding the windowing, you can can have a streaming job that creates a pcollection from in-memory data and have a side input from the GCS storage. 关于窗口化,您可以具有一个流作业,该作业可以从内存中的数据创建一个pcollection,并可以从GCS存储中进行侧面输入。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 python中的Apache Beam:如何在另一个PCollection上重用完全相同的转换 - Apache Beam in python: How to reuse exactly the same transform on another PCollection 如何从PCollection Apache Beam Python创建N个元素组 - How to create groups of N elements from a PCollection Apache Beam Python Apache Beam-Python:如何通过累积获取PCollection的前10个元素? - Apache Beam - Python : How to get the top 10 elements of a PCollection with Accumulation? 如何计算Apache Beam中PCollection的元素数量 - How to calculate the number of elements of a PCollection in Apache beam Apache 光束列表到 PCollection - Apache beam list to PCollection Apache 光束 HTTP 无界源 Python - Apache Beam HTTP Unbounded Source Python 如何通过Apache Beam(Python)中的键以流模式在静态查找表上加入PCollection - How to join PCollection in streaming mode on a static lookup table by key in Apache Beam (Python) 我可以使用 python 对 Apache beam PCollection 中的项目进行排序吗? - Can I sort the items in an Apache beam PCollection using python? 使用python的Apache Beam中PCollection内几个字段的最大值和最小值 - Max and Min for several fields inside PCollection in apache beam with python 如何在单元测试时正确测试 pcollection 长度 Apache Beam - How to properly test pcollection length when unit testing Apache Beam
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM