包含字典的pyspark数据帧列的总和

Question

I have a dataframe containing only one column which has elements of the type MapType(StringType(), IntegerType()) . 我有一个只包含一列的数据框，其中包含MapType(StringType(), IntegerType())类型的元素。 I would like to obtain the cumulative sum of that column, where the sum operation would mean adding two dictionaries. 我想获得该列的累积总和，其中sum操作意味着添加两个词典。

Minimal example 最小的例子

a = [{'Maps': ({'a': 1, 'b': 2, 'c': 3})}, {'Maps': ({'a': 2, 'b': 4, 'd': 6})}]
df = spark.createDataFrame(a)
df.show(5, False)

+---------------------------+
|Maps                       |
+---------------------------+
|Map(a -> 1, b -> 2, c -> 3)|
|Map(a -> 2, b -> 4, d -> 6)|
+---------------------------+

If I were to obtain the cumulative sum of the column Maps , I should get the following result. 如果我要获得列Maps的累积总和，我应该得到以下结果。

+-----------------------------------+
|Maps                               |
+-----------------------------------+
|Map(a -> 3, b -> 6, c -> 3, d -> 6)|
+-----------------------------------+

PS I am using Python 2.6, so collections.Counter is not available. PS我使用的是Python 2.6，因此collections.Counter不可用。 I can probably install it if absolutely necessary. 如果绝对必要，我可以安装它。

My attempts: 我的尝试：

I have tried an accumulator based approach and an approach that uses fold . 我尝试过基于accumulator的方法和使用fold的方法。

Accumulator 累加器

def addDictFun(x):
    global v
    v += x

class DictAccumulatorParam(AccumulatorParam):
    def zero(self, d):
        return d
    def addInPlace(self, d1, d2):
        for k in d1:
            d1[k] = d1[k] + (d2[k] if k in d2 else 0)
        for k in d2:
            if k not in d1:
                d1[k] = d2[k]
        return d1

v = sc.accumulator(MapType(StringType(), IntegerType()), DictAccumulatorParam())
cumsum_dict = df.rdd.foreach(addDictFun)

Now at the end, I should have the resulting dictionary in v . 现在最后，我应该在v得到结果字典。 Instead, I get the error MapType is not iterable (mostly on the line for k in d1 in the function addInPlace ). 相反，我得到错误MapType不可迭代（主要在函数addInPlace for k in d1中的for k in d1行）。

rdd.fold rdd.fold

The rdd.fold based approach is as follows: 基于rdd.fold的方法如下：

def add_dicts(d1, d2):
    for k in d1:
        d1[k] = d1[k] + (d2[k] if k in d2 else 0)
    for k in d2:
        if k not in d1:
            d1[k] = d2[k]
    return d1

cumsum_dict = df.rdd.fold(MapType(StringType(), IntegerType()), add_dicts)

However, I get the same MapType is not iterable error here. 但是，我在这里获得相同的MapType is not iterable错误。 Any idea where I am going wrong? 知道我哪里错了吗？

Answer 1

pyspark.sql.types are schema descriptors, not collections or external language representations so cannot be used with fold or Accumulator . pyspark.sql.types是模式描述符，不是集合或外部语言表示，因此不能与fold或Accumulator一起使用。

The most straightforward solution is to explode and aggregate 最直接的解决方案是explode和聚合

from pyspark.sql.functions import explode

df = spark.createDataFrame(
    [{'a': 1, 'b': 2, 'c': 3}, {'a': 2, 'b': 4, 'd': 6}], 
    "map<string,integer>"
).toDF("Maps")

df.select(explode("Maps")).groupBy("key").sum("value").rdd.collectAsMap()
# {'d': 6, 'c': 3, 'b': 6, 'a': 3}

With RDD you can do a similar thing: 使用RDD您可以执行类似的操作：

from operator import add

df.rdd.flatMap(lambda row: row.Maps.items()).reduceByKey(add).collectAsMap()
# {'b': 6, 'c': 3, 'a': 3, 'd': 6}

or if you really want fold 或者如果你真的想fold

from operator import attrgetter
from collections import defaultdict

def merge(acc, d):
    for k in d:
        acc[k] += d[k]
    return acc

df.rdd.map(attrgetter("Maps")).fold(defaultdict(int), merge)
# defaultdict(int, {'a': 3, 'b': 6, 'c': 3, 'd': 6})

Answer 2

@user8371915's answer using explode is more generic, but here's another approach that may be faster if you knew the keys ahead of time: @ user8371915使用explode的答案更通用，但如果你提前知道密钥，这里的另一种方法可能会更快：

import pyspark.sql.functions as f
myKeys = ['a', 'b', 'c', 'd']
df.select(*[f.sum(f.col('Maps').getItem(k)).alias(k) for k in myKeys]).show()
#+---+---+---+---+
#|  a|  b|  c|  d|
#+---+---+---+---+
#|  3|  6|  3|  6|
#+---+---+---+---+

And if you wanted the result in a MapType() , you could use pyspark.sql.functions.create_map like: 如果你想在MapType()得到结果，你可以使用pyspark.sql.functions.create_map如：

from itertools import chain
df.select(
    f.create_map(
        list(
            chain.from_iterable(
                [[f.lit(k), f.sum(f.col('Maps').getItem(k))] for k in myKeys]
            )
        )
    ).alias("Maps")
).show(truncate=False)
#+-----------------------------------+
#|Maps                               |
#+-----------------------------------+
#|Map(a -> 3, b -> 6, c -> 3, d -> 6)|
#+-----------------------------------+

包含字典的pyspark数据帧列的总和

问题描述

2 个解决方案

解决方案1
2 已采纳 2018-07-04 23:01:47

解决方案2
1 2018-07-05 16:13:28

包含字典的pyspark数据帧列的总和

问题描述

2 个解决方案

解决方案1 2 已采纳 2018-07-04 23:01:47

解决方案2 1 2018-07-05 16:13:28

解决方案1
2 已采纳 2018-07-04 23:01:47

解决方案2
1 2018-07-05 16:13:28