简体   繁体   English

Mapreduce Python

[英]Mapreduce Python

I'm completely new to Python and MapReduce.我对 Python 和 MapReduce 完全陌生。 It would be great if someone can help me to achieve below results.如果有人可以帮助我实现以下结果,那就太好了。 I want to calculate the count of key and the average of values per key from a list like below.我想从下面的列表中计算键的计数和每个键的平均值。 The first number in the pair is the key and the second is the value.对中的第一个数字是键,第二个是值。

  • 1,5 1,5
  • 1,5 1,5
  • 2,7 2,7
  • 2,8 2,8
  • 1,10 1,10
  • 2,10 2,10
  • 3,3 3,3
  • 1,20 1,20

The output will look as below. output 如下所示。

  • 1, 4, 10 1、4、10
  • 2, 3, 8.3 2、3、8.3
  • 3, 1, 3 3、1、3

Thank you谢谢

I would recommend you to use itertools instead of reduce.我建议您使用 itertools 而不是 reduce。

import itertools
import functools
import statistics

data = [[1,5], [1,5], [2,7], [2,8], [1,10], [2,10], [3,3], [1,20]]

# First, sort and group the input by key
sorted_data = sorted(data, key=lambda x: x[0])
grouped = itertools.groupby(sorted_data, lambda e: e[0])

# This will result in a structure like this:
# [
#   (1, [[1, 5], [1, 5], [1, 10], [1, 20]]),
#   (2, [[2, 7], [2, 8], [2, 10]]),
#   (3, [[3, 3]])
# ]

# Remove the duplicate keys from the structure
remove_duplicate_keys = map(lambda x: (x[0], [e[1] for e in x[1]]), grouped)

# This will produce the following structure:
# [
#   (1, [5, 5, 10, 20]),
#   (2, [7, 8, 10]),
#   (3, [3])
# ]

# Now, calculate count and mean for each entry
result = map(lambda x: (x[0], len(x[1]), statistics.mean(x[1])), remove_dublicate_keys)

# This will result in the following list:
# [(1, 4, 10), (2, 3, 8.333333333333334), (3, 1, 3)]

Note: All instructions will return generators.注意:所有指令都将返回生成器。 This means python will not calculate anything until you start using it.这意味着 python 在您开始使用之前不会计算任何东西。 But you can only access the elements once.但是您只能访问元素一次。 If you need them to be in a regular list or need to access the information multiple times, replace the last line with this:如果您需要它们在常规列表中或需要多次访问信息,请将最后一行替换为:

result = list(map(lambda x: (x[0], len(x[1]), statistics.mean(x[1])), remove_dublicate_keys))

This will convert the original generator chain into a regular list.这会将原始生成器链转换为常规列表。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM