pyspark：计算列表中不同元素的出现次数

Question

I have to following data:我必须关注数据：

data = {'date': ['2014-01-01', '2014-01-02', '2014-01-03', '2014-01-04', '2014-01-05', '2014-01-06'],
     'flat': ['A;A;B', 'D;P;E;P;P', 'H;X', 'P;Q;G', 'S;T;U', 'G;C;G']}

data['date'] = pd.to_datetime(data['date'])

data = pd.DataFrame(data)
data['date'] = pd.to_datetime(data['date'])
spark = SparkSession.builder \
    .master('local[*]') \
    .config("spark.driver.memory", "500g") \
    .appName('my-pandasToSparkDF-app') \
    .getOrCreate()
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
spark.sparkContext.setLogLevel("OFF")

df=spark.createDataFrame(data)
new_frame = df.withColumn("list", F.split("flat", "\;"))

I would like to add a new column which holds the number of occurrences of each distinct element (sorted in ascending order) and another column which holds the maximum:我想添加一个新列，其中包含每个不同元素的出现次数（按升序排序）和另一个包含最大值的列：

+-------------------+-----------+---------------------+-----------+----+
|               date| flat      | list                |occurrences|max |
+-------------------+-----------+---------------------+-----------+----+
|2014-01-01 00:00:00|A;A;B      |['A','A','B']        |[1,2]      |2   |
|2014-01-02 00:00:00|D;P;E;P;P  |['D','P','E','P','P']|[1,1,3]    |3   |
|2014-01-03 00:00:00|H;X        |['H','X']            |[1,1]      |1   |
|2014-01-04 00:00:00|P;Q;G      |['P','Q','G']        |[1,1,1]    |1   |
|2014-01-05 00:00:00|S;T;U      |['S','T','U']        |[1,1,1]    |1   |
|2014-01-06 00:00:00|G;C;G      |['G','C','G']        |[1,2]      |2   |  
+-------------------+-----------+---------------------+-----------+----+

Thank you very much!非常感谢！

Answer 1

For Spark2.4+ this can be achieved without multiple groupBys and aggregations ( as they are expensive shuffle operations in big data ).对于Spark2.4+ ，这可以在没有多个 groupBys 和聚合的情况下实现（因为它们在大数据中是昂贵的 shuffle 操作）。 You can do this using one expression of higher order functions transform and aggregate .您可以使用高阶函数transform和aggregate的one expression来做到这一点。 This should be the canonical solution for spark2.4.这应该是 spark2.4 的规范解决方案。

from pyspark.sql import functions as F
df=spark.createDataFrame(data)
df.withColumn("list", F.split("flat","\;"))\
  .withColumn("occurances", F.expr("""array_sort(transform(array_distinct(list), x-> aggregate(list, 0,(acc,t)->acc+IF(t=x,1,0))))"""))\
  .withColumn("max", F.array_max("occurances"))\
  .show()
+-------------------+---------+---------------+----------+---+
|               date|     flat|           list|occurances|max|
+-------------------+---------+---------------+----------+---+
|2014-01-01 00:00:00|    A;A;B|      [A, A, B]|    [1, 2]|  2|
|2014-01-02 00:00:00|D;P;E;P;P|[D, P, E, P, P]| [1, 1, 3]|  3|
|2014-01-03 00:00:00|      H;X|         [H, X]|    [1, 1]|  1|
|2014-01-04 00:00:00|    P;Q;G|      [P, Q, G]| [1, 1, 1]|  1|
|2014-01-05 00:00:00|    S;T;U|      [S, T, U]| [1, 1, 1]|  1|
|2014-01-06 00:00:00|    G;C;G|      [G, C, G]|    [1, 2]|  2|
+-------------------+---------+---------------+----------+---+

Answer 2

You can do this by a couple of groupBy statements,您可以通过几个 groupBy 语句来做到这一点，

To start with you have a dataframe like this,首先你有一个像这样的 dataframe，

+-------------------+---------+---------------+
|               date|     flat|           list|
+-------------------+---------+---------------+
|2014-01-01 00:00:00|    A;A;B|      [A, A, B]|
|2014-01-02 00:00:00|D;P;E;P;P|[D, P, E, P, P]|
|2014-01-03 00:00:00|      H;X|         [H, X]|
|2014-01-04 00:00:00|    P;Q;G|      [P, Q, G]|
|2014-01-05 00:00:00|    S;T;U|      [S, T, U]|
|2014-01-06 00:00:00|    G;C;G|      [G, C, G]|
+-------------------+---------+---------------+

Explode the list columns using F.explode like this,像这样使用F.explode分解list列，

new_frame_exp = new_frame.withColumn("exp", F.explode('list'))

Then, your dataframe will look like this,然后，您的 dataframe 将如下所示，

+-------------------+---------+---------------+---+
|               date|     flat|           list|exp|
+-------------------+---------+---------------+---+
|2014-01-01 00:00:00|    A;A;B|      [A, A, B]|  A|
|2014-01-01 00:00:00|    A;A;B|      [A, A, B]|  A|
|2014-01-01 00:00:00|    A;A;B|      [A, A, B]|  B|
|2014-01-02 00:00:00|D;P;E;P;P|[D, P, E, P, P]|  D|
|2014-01-02 00:00:00|D;P;E;P;P|[D, P, E, P, P]|  P|
|2014-01-02 00:00:00|D;P;E;P;P|[D, P, E, P, P]|  E|
|2014-01-02 00:00:00|D;P;E;P;P|[D, P, E, P, P]|  P|
|2014-01-02 00:00:00|D;P;E;P;P|[D, P, E, P, P]|  P|
|2014-01-03 00:00:00|      H;X|         [H, X]|  H|
|2014-01-03 00:00:00|      H;X|         [H, X]|  X|
|2014-01-04 00:00:00|    P;Q;G|      [P, Q, G]|  P|
|2014-01-04 00:00:00|    P;Q;G|      [P, Q, G]|  Q|
|2014-01-04 00:00:00|    P;Q;G|      [P, Q, G]|  G|
|2014-01-05 00:00:00|    S;T;U|      [S, T, U]|  S|
|2014-01-05 00:00:00|    S;T;U|      [S, T, U]|  T|
|2014-01-05 00:00:00|    S;T;U|      [S, T, U]|  U|
|2014-01-06 00:00:00|    G;C;G|      [G, C, G]|  G|
|2014-01-06 00:00:00|    G;C;G|      [G, C, G]|  C|
|2014-01-06 00:00:00|    G;C;G|      [G, C, G]|  G|
+-------------------+---------+---------------+---+

On this dataframe, do a groupBy like this,在这个dataframe上，做一个groupBy这样，

new_frame_exp_agg = new_frame_exp.groupBy('date', 'flat', 'list', 'exp').count()

Then you will have a dataframe like this,然后你会有一个像这样的dataframe，

+-------------------+---------+---------------+---+-----+
|               date|     flat|           list|exp|count|
+-------------------+---------+---------------+---+-----+
|2014-01-03 00:00:00|      H;X|         [H, X]|  H|    1|
|2014-01-04 00:00:00|    P;Q;G|      [P, Q, G]|  G|    1|
|2014-01-05 00:00:00|    S;T;U|      [S, T, U]|  U|    1|
|2014-01-05 00:00:00|    S;T;U|      [S, T, U]|  T|    1|
|2014-01-04 00:00:00|    P;Q;G|      [P, Q, G]|  P|    1|
|2014-01-03 00:00:00|      H;X|         [H, X]|  X|    1|
|2014-01-06 00:00:00|    G;C;G|      [G, C, G]|  G|    2|
|2014-01-02 00:00:00|D;P;E;P;P|[D, P, E, P, P]|  E|    1|
|2014-01-06 00:00:00|    G;C;G|      [G, C, G]|  C|    1|
|2014-01-05 00:00:00|    S;T;U|      [S, T, U]|  S|    1|
|2014-01-01 00:00:00|    A;A;B|      [A, A, B]|  B|    1|
|2014-01-02 00:00:00|D;P;E;P;P|[D, P, E, P, P]|  D|    1|
|2014-01-04 00:00:00|    P;Q;G|      [P, Q, G]|  Q|    1|
|2014-01-01 00:00:00|    A;A;B|      [A, A, B]|  A|    2|
|2014-01-02 00:00:00|D;P;E;P;P|[D, P, E, P, P]|  P|    3|
+-------------------+---------+---------------+---+-----+

On this dataframe, apply one more level of aggregation to collect the counts to list and find max like this,在这个 dataframe 上，再应用一层聚合来收集要列出的计数并像这样找到最大值，

res = new_frame_exp_agg.groupBy('date', 'flat', 'list').agg(
                                         F.collect_list('count').alias('occurances'),
                                         F.max('count').alias('max'))

res.orderBy('date').show()


+-------------------+---------+---------------+----------+---+
|               date|     flat|           list|occurances|max|
+-------------------+---------+---------------+----------+---+
|2014-01-01 00:00:00|    A;A;B|      [A, A, B]|    [2, 1]|  2|
|2014-01-02 00:00:00|D;P;E;P;P|[D, P, E, P, P]| [1, 1, 3]|  3|
|2014-01-03 00:00:00|      H;X|         [H, X]|    [1, 1]|  1|
|2014-01-04 00:00:00|    P;Q;G|      [P, Q, G]| [1, 1, 1]|  1|
|2014-01-05 00:00:00|    S;T;U|      [S, T, U]| [1, 1, 1]|  1|
|2014-01-06 00:00:00|    G;C;G|      [G, C, G]|    [1, 2]|  2|
+-------------------+---------+---------------+----------+---+

If you want the column occurance sorted, you can use F.array_sort over the column if you are on spark 2.4+ else you have to write a udf for that.如果您希望对列出现进行排序，如果您使用的是 spark occurance ，则可以在列上使用F.array_sort ，否则您必须为此编写一个 udf。

pyspark：计算列表中不同元素的出现次数

问题描述

2 个解决方案

解决方案1
1 2020-04-12 18:11:16

解决方案2
0 已采纳 2020-04-12 13:05:56

pyspark：计算列表中不同元素的出现次数

问题描述

2 个解决方案

解决方案1 1 2020-04-12 18:11:16

解决方案2 0 已采纳 2020-04-12 13:05:56

解决方案1
1 2020-04-12 18:11:16

解决方案2
0 已采纳 2020-04-12 13:05:56