[英]Apache Beam Min, Max and Average
From this link , Guillem Xercavins wrote a custom class for compute minimum and maximum. 通过此链接 ,Guillem Xercavins为计算最小值和最大值编写了一个自定义类。
class MinMaxFn(beam.CombineFn):
# initialize min and max values (I assumed int type)
def create_accumulator(self):
return (sys.maxint, 0)
# update if current value is a new min or max
def add_input(self, min_max, input):
(current_min, current_max) = min_max
return min(current_min, input), max(current_max, input)
def merge_accumulators(self, accumulators):
return accumulators
def extract_output(self, min_max):
return min_max
I need to compute average as well, and I found sample code as below: 我还需要计算平均值,我发现示例代码如下:
class MeanCombineFn(beam.CombineFn):
def create_accumulator(self):
"""Create a "local" accumulator to track sum and count."""
return (0, 0)
def add_input(self, (sum_, count), input):
"""Process the incoming value."""
return sum_ + input, count + 1
def merge_accumulators(self, accumulators):
"""Merge several accumulators into a single one."""
sums, counts = zip(*accumulators)
return sum(sums), sum(counts)
def extract_output(self, (sum_, count)):
"""Compute the mean average."""
if count == 0:
return float('NaN')
return sum_ / float(count)
Any idea how to merge the average method into MinMax so I can have only one class that able to compute Minimum, Maximum and Average all together and produce a set of key and values- array of 3 values? 知道如何将平均方法合并到MinMax中,这样我就可以只有一个能够同时计算最小值,最大值和平均值的类,并生成一组键值和3个值的数组?
Here is the combined class solution, with addition of median 这是组合类解决方案,增加了中位数
import numpy as np
class MinMaxMeanFn(beam.CombineFn):
def create_accumulator(self):
# sum, min, max, count, median
return (0.0, 999999999.0, 0.0, 0, [])
def add_input(self, cur_data, input):
(cur_sum, cur_min, cur_max, count, cur_median) = cur_data
if type(input) == list:
cur_count = len(input)
sum_input = sum(input)
min_input = min(input)
max_input = max(input)
else:
sum_input = input
cur_count = 1
return cur_sum + sum_input, min(min_input, cur_min), max(max_input, cur_max), count + cur_count, cur_median + input
def merge_accumulators(self, accumulators):
sums, mins, maxs, counts, medians = zip(*accumulators)
return sum(sums), min(mins), max(maxs), sum(counts), medians
def extract_output(self, cur_data):
(sum, min, max, count, medians) = cur_data
avg = sum / count if count else float('NaN')
med = np.median(medians)
return {
"max": max,
"min": min,
"avg": avg,
"count": count,
"median": med
}
Example usage: 用法示例:
( input |'Format Price' >> beam.ParDo(FormatPriceDoFn())
|'Group Price by ID' >> beam.GroupByKey()
|'Compute price statistic for each ID' >> beam.CombinePerKey(MinMaxMeanFn()))
*I did not test if CombinePerKey works without GroupByKey, feel free to test it out. *我没有测试CombinePerKey是否在没有GroupByKey的情况下工作,请随意测试它。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.