使用python的Apache Beam中PCollection内几个字段的最大值和最小值

Question

我正在通过python SDK使用apache Beam，并遇到以下问题：

我有一个大约有1百万个条目的PCollection，一个PCollection中的每个条目看起来像一个长度为150的2元组[(key1,value1),(key2,value2),...] 。 我需要在每个键的PCollection的所有条目中找到最大值和最小值，以便对值进行规范化。

理想情况下，获得带有元组列表[(key,max_value,min_value),...]然后可以很容易地进行规范化以获得[(key1,norm_value1),(key2,norm_value2),...] ，其中norm_value = (value - min) / (max - min)

目前，我只能手动对每个键分别进行操作，这不是很方便也不可持续，因此任何建议都会有所帮助。

Answer 1

我决定使用自定义的CombineFn函数确定每个键的最小值和最大值。 然后，使用CoGroupByKey将它们与输入数据连接起来，并应用所需的映射以标准化值。

"""Normalize PCollection values."""

import logging
import argparse
import sys

import apache_beam as beam
from apache_beam.io import WriteToText
from apache_beam.options.pipeline_options import PipelineOptions


# custom CombineFn that outputs min and max value
class MinMaxFn(beam.CombineFn):
  # initialize min and max values (I assumed int type)
  def create_accumulator(self):
    return (sys.maxint, 0)

  # update if current value is a new min or max      
  def add_input(self, min_max, input):
    (current_min, current_max) = min_max
    return min(current_min, input), max(current_max, input)

  def merge_accumulators(self, accumulators):
    return accumulators

  def extract_output(self, min_max):
    return min_max


def run(argv=None):
  """Main entry point; defines and runs the pipeline."""
  parser = argparse.ArgumentParser()
  parser.add_argument('--output',
                      dest='output',
                      required=True,
                      help='Output file to write results to.')
  known_args, pipeline_args = parser.parse_known_args(argv)

  pipeline_options = PipelineOptions(pipeline_args)
  p = beam.Pipeline(options=pipeline_options)

  # create test data
  pc = [('foo', 1), ('bar', 5), ('foo', 5), ('bar', 9), ('bar', 2)]

  # first run through data to apply custom combineFn and determine min/max per key
  minmax = pc | 'Determine Min Max' >> beam.CombinePerKey(MinMaxFn())

  # group input data by key and append corresponding min and max 
  merged = (pc, minmax) | 'Join Pcollections' >> beam.CoGroupByKey()

  # apply mapping to normalize values according to 'norm_value = (value - min) / (max - min)'
  normalized = merged | 'Normalize values' >> beam.Map(lambda (a, (b, c)): (a, [float(val - c[0][0][0])/(c[0][0][1] -c[0][0][0]) for val in b]))

  # write results to output file
  normalized | 'Write results' >> WriteToText(known_args.output)

  result = p.run()
  result.wait_until_finish()

if __name__ == '__main__':
  logging.getLogger().setLevel(logging.INFO)
  run()

可以使用python SCRIPT_NAME.py --output OUTPUT_FILENAME运行该代码段。 我的测试数据按键分组为：

('foo', [1, 5])
('bar', [5, 9, 2])

CombineFn将根据每个键的最小值和最大值返回：

('foo', [(1, 5)])
('bar', [(2, 9)])

join / cogroup的按键操作输出：

('foo', ([1, 5], [[(1, 5)]]))
('bar', ([5, 9, 2], [[(2, 9)]]))

归一化后：

('foo', [0.0, 1.0])
('bar', [0.42857142857142855, 1.0, 0.0])

这只是一个简单的测试，因此我确定可以针对提到的数据量对其进行优化，但它似乎可以作为起点。 考虑到可能需要进一步考虑（例如，如果min = max，请避免除以零）

使用python的Apache Beam中PCollection内几个字段的最大值和最小值

问题描述

1 个解决方案

解决方案1
3 已采纳 2018-04-29 09:33:47

使用python的Apache Beam中PCollection内几个字段的最大值和最小值

问题描述

1 个解决方案

解决方案1 3 已采纳 2018-04-29 09:33:47

解决方案1
3 已采纳 2018-04-29 09:33:47