[英]sum of a pyspark dataframe column containing dictionaries
I have a dataframe containing only one column which has elements of the type MapType(StringType(), IntegerType())
. 我有一个只包含一列的数据框,其中包含
MapType(StringType(), IntegerType())
类型的元素。 I would like to obtain the cumulative sum of that column, where the sum
operation would mean adding two dictionaries. 我想获得该列的累积总和,其中
sum
操作意味着添加两个词典。
Minimal example 最小的例子
a = [{'Maps': ({'a': 1, 'b': 2, 'c': 3})}, {'Maps': ({'a': 2, 'b': 4, 'd': 6})}]
df = spark.createDataFrame(a)
df.show(5, False)
+---------------------------+
|Maps |
+---------------------------+
|Map(a -> 1, b -> 2, c -> 3)|
|Map(a -> 2, b -> 4, d -> 6)|
+---------------------------+
If I were to obtain the cumulative sum of the column Maps
, I should get the following result. 如果我要获得列
Maps
的累积总和,我应该得到以下结果。
+-----------------------------------+
|Maps |
+-----------------------------------+
|Map(a -> 3, b -> 6, c -> 3, d -> 6)|
+-----------------------------------+
PS I am using Python 2.6, so collections.Counter
is not available. PS我使用的是Python 2.6,因此
collections.Counter
不可用。 I can probably install it if absolutely necessary. 如果绝对必要,我可以安装它。
My attempts: 我的尝试:
I have tried an accumulator
based approach and an approach that uses fold
. 我尝试过基于
accumulator
的方法和使用fold
的方法。
Accumulator 累加器
def addDictFun(x):
global v
v += x
class DictAccumulatorParam(AccumulatorParam):
def zero(self, d):
return d
def addInPlace(self, d1, d2):
for k in d1:
d1[k] = d1[k] + (d2[k] if k in d2 else 0)
for k in d2:
if k not in d1:
d1[k] = d2[k]
return d1
v = sc.accumulator(MapType(StringType(), IntegerType()), DictAccumulatorParam())
cumsum_dict = df.rdd.foreach(addDictFun)
Now at the end, I should have the resulting dictionary in v
. 现在最后,我应该在
v
得到结果字典。 Instead, I get the error MapType
is not iterable (mostly on the line for k in d1
in the function addInPlace
). 相反,我得到错误
MapType
不可迭代(主要在函数addInPlace
for k in d1
中的for k in d1
行)。
rdd.fold rdd.fold
The rdd.fold
based approach is as follows: 基于
rdd.fold
的方法如下:
def add_dicts(d1, d2):
for k in d1:
d1[k] = d1[k] + (d2[k] if k in d2 else 0)
for k in d2:
if k not in d1:
d1[k] = d2[k]
return d1
cumsum_dict = df.rdd.fold(MapType(StringType(), IntegerType()), add_dicts)
However, I get the same MapType is not iterable
error here. 但是,我在这里获得相同的
MapType is not iterable
错误。 Any idea where I am going wrong? 知道我哪里错了吗?
pyspark.sql.types
are schema descriptors, not collections or external language representations so cannot be used with fold
or Accumulator
. pyspark.sql.types
是模式描述符,不是集合或外部语言表示,因此不能与fold
或Accumulator
一起使用。
The most straightforward solution is to explode
and aggregate 最直接的解决方案是
explode
和聚合
from pyspark.sql.functions import explode
df = spark.createDataFrame(
[{'a': 1, 'b': 2, 'c': 3}, {'a': 2, 'b': 4, 'd': 6}],
"map<string,integer>"
).toDF("Maps")
df.select(explode("Maps")).groupBy("key").sum("value").rdd.collectAsMap()
# {'d': 6, 'c': 3, 'b': 6, 'a': 3}
With RDD
you can do a similar thing: 使用
RDD
您可以执行类似的操作:
from operator import add
df.rdd.flatMap(lambda row: row.Maps.items()).reduceByKey(add).collectAsMap()
# {'b': 6, 'c': 3, 'a': 3, 'd': 6}
or if you really want fold
或者如果你真的想
fold
from operator import attrgetter
from collections import defaultdict
def merge(acc, d):
for k in d:
acc[k] += d[k]
return acc
df.rdd.map(attrgetter("Maps")).fold(defaultdict(int), merge)
# defaultdict(int, {'a': 3, 'b': 6, 'c': 3, 'd': 6})
@user8371915's answer using explode
is more generic, but here's another approach that may be faster if you knew the keys ahead of time: @ user8371915使用
explode
的答案更通用,但如果你提前知道密钥,这里的另一种方法可能会更快:
import pyspark.sql.functions as f
myKeys = ['a', 'b', 'c', 'd']
df.select(*[f.sum(f.col('Maps').getItem(k)).alias(k) for k in myKeys]).show()
#+---+---+---+---+
#| a| b| c| d|
#+---+---+---+---+
#| 3| 6| 3| 6|
#+---+---+---+---+
And if you wanted the result in a MapType()
, you could use pyspark.sql.functions.create_map
like: 如果你想在
MapType()
得到结果,你可以使用pyspark.sql.functions.create_map
如:
from itertools import chain
df.select(
f.create_map(
list(
chain.from_iterable(
[[f.lit(k), f.sum(f.col('Maps').getItem(k))] for k in myKeys]
)
)
).alias("Maps")
).show(truncate=False)
#+-----------------------------------+
#|Maps |
#+-----------------------------------+
#|Map(a -> 3, b -> 6, c -> 3, d -> 6)|
#+-----------------------------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.