简体   繁体   English

如何使用pyspark计算出现次数

[英]How to count number of occurrences by using pyspark

I'm trying to use pyspark to count the number of occurrences. 我正在尝试使用pyspark来计算出现次数。

Suppose I have data like this: 假设我有这样的数据:

data = sc.parallelize([(1,[u'a',u'b',u'd']),
                       (2,[u'a',u'c',u'd']),
                       (3,[u'a']) ])

count = sc.parallelize([(u'a',0),(u'b',0),(u'c',0),(u'd',0)])

Is possible to count the number of occurrences in data and update in count ? 是否可以计算data的出现次数并更新count

The result should be like [(u'a',3),(u'b',1),(u'c',1),(u'd',2)] . 结果应该像[(u'a',3),(u'b',1),(u'c',1),(u'd',2)]

I would use Counter : 我会用Counter

>>> from collections import Counter
>>>
>>> data.values().map(Counter).reduce(lambda x, y: x + y)
Counter({'a': 3, 'b': 1, 'c': 1, 'd': 2})

RDDs are immutable and thus cannot be updated. RDD是不可变的,因此无法更新。 Instead, you compute the count based on your data as: 而是根据您的数据计算计数

count = (rdd
         .flatMap(lambda (k, data): data)
         .map(lambda w: (w,1))
         .reduceByKey(lambda a, b: a+b))

Then, if the result can fit in master main memory feel free to .collect() from count . 然后,如果结果可以适合主主存储器,则可以从count计算 .collect()。

You wouldn't update count since RDDs are immutable. 由于RDD是不可变的,因此您不会更新 count Just run the calculation you want and then save directly to any variable you want: 只需运行所需的计算,然后直接保存到您想要的任何变量:

In [17]: data.flatMap(lambda x: x[1]).map(lambda x: (x, 1)).reduceByKey(lambda x, y: x + y).collect()
Out[17]: [('b', 1), ('c', 1), ('d', 2), ('a', 3)]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM