简体   繁体   English

为 Pyspark 中的每一行计算列中不同的子字符串出现次数?

[英]Counting distinct substring occurrences in column for every row in Pyspark?

My data set looks like this, where I have a comma delimited set of string values in col1 and col2 , and col3 is the two columns that have been concatted together.我的数据集看起来像这样,我在col1col2有一组逗号分隔的字符串值,而col3是连接在一起的两列。

+===========+========+===========
|col1       |col2    |col3   
+===========+========+===========
|a,b,c,d    |a,c,d   |a,b,c,d,a,c,d 
|e,f,g      |f,g,h   |e,f,g,f,g,h
+===========+========+===========

Basically, what I'm trying to do is grab all the values that are separated by commas in col3 and append another column with each of the values and their counts.基本上,我想要做的是获取col3以逗号分隔的所有值,并在另一列中附加每个值及其计数。

Basically, i'm trying to get this kind of output in col4 :基本上,我试图在col4获得这种输出:

+===========+========+==============+======================
|col1       |col2    |col3          |col4
+===========+========+==============+======================
|a,b,c,d    |a,c,d   |a,b,c,d,a,c,d |a: 2, b: 1, c: 2, d: 2
|e,f,g      |f,g,h   |e,f,g,f,g,h   |e: 1, f: 2, g: 2, h: 1
+===========+========+==============+======================

I've figured out how to concat the columns together to get to col3 , but I'm having a bit of trouble getting to col4 .我已经想出了如何将列连接在一起以到达col3 ,但是我在到达col4时遇到了一些麻烦。 Here's where I've left off and I'm a bit unsure of where to go from here:这是我离开的地方,我有点不确定从哪里开始:

from pyspark.sql.functions import concat, countDistinct


df = df.select(concat(df.col1, df.col2).alias('col3'), '*')
df.agg(countDistinct('col3')).show()

+---------------------+ |count(DISTINCT col3)| +---------------------+ | 2| +---------------------+

Question: How I dynamically count the substrings separated by comma in col3 and make a final column that shows the frequency of each substring for all rows in the dataset?问题:我如何动态计算col3由逗号分隔的子字符串,并制作最后一列,显示数据集中所有行的每个子字符串的频率?

Use UDF使用 UDF

Here is a way to do this using udfs.这是使用 udfs 执行此操作的一种方法。 First do the data generation.首先进行数据生成。

from pyspark.sql.types import StringType, StructType, StructField
from pyspark.sql.functions import concat_ws, udf

data = [("a,b,c,d", "a,c,d", "a,b,c,d,a,c,d"),
        ("e,f,g", "f,g,h", "e,f,g,f,g,h")
       ]
schema = StructType([
    StructField("col1",StringType(),True),
    StructField("col2",StringType(),True),
    StructField("col3",StringType(),True),
])
df = spark.createDataFrame(data=data,schema=schema)

Then use some native python functions like Counter and json to accomplish the task.然后使用一些原生的 python 函数,如 Counter 和 json 来完成任务。

from collections import Counter
import json

@udf(StringType())
def count_occurances(row):
    return json.dumps(dict(Counter(row.split(','))))

df.withColumn('concat', concat_ws(',', df.col1, df.col2, df.col3))\
  .withColumn('counts', count_occurances('concat')).show(2, False)

Results in结果是

+-------+-----+-------------+---------------------------+--------------------------------+
|col1   |col2 |col3         |concat                     |counts                          |
+-------+-----+-------------+---------------------------+--------------------------------+
|a,b,c,d|a,c,d|a,b,c,d,a,c,d|a,b,c,d,a,c,d,a,b,c,d,a,c,d|{"a": 4, "b": 2, "c": 4, "d": 4}|
|e,f,g  |f,g,h|e,f,g,f,g,h  |e,f,g,f,g,h,e,f,g,f,g,h    |{"e": 2, "f": 4, "g": 4, "h": 2}|
+-------+-----+-------------+---------------------------+--------------------------------+

Solution using native pyspark functions使用原生pyspark函数的解决方案

This solution is a bit more complicated than using the udf but could be more performant due to the lack of udfs.此解决方案比使用 udf 稍微复杂一些,但由于缺少 udf,性能可能更高。 The idea is to concat the three string columns and explode them.这个想法是连接三个字符串列并分解它们。 In order to know from where each exploded row came we add an index.为了知道每个分解的行来自哪里,我们添加了一个索引。 Double grouping will help us getting the desired result.双重分组将帮助我们获得所需的结果。 In the end we join the result back to the original frame to get the desired schema.最后,我们将结果连接回原始帧以获得所需的模式。

from pyspark.sql.functions import concat_ws, monotonically_increasing_id, split, explode, collect_list
df = df.withColumn('index', monotonically_increasing_id())

df.join(
    df
  .withColumn('concat', concat_ws(',', df.col1, df.col2, df.col3))\
  .withColumn('arr_col', split('concat', ','))\
  .withColumn('explode_col', explode('arr_col'))\
  .groupBy('index', 'explode_col').count()\
  .withColumn('concat_counts', concat_ws(':', 'explode_col', 'count'))\
  .groupBy('index').agg(concat_ws(',', collect_list('concat_counts')).alias('grouped_counts')), on='index').show()

results in结果是

+-----------+-------+-----+-------------+---------------+
|      index|   col1| col2|         col3| grouped_counts|
+-----------+-------+-----+-------------+---------------+
|42949672960|a,b,c,d|a,c,d|a,b,c,d,a,c,d|a:4,b:2,c:4,d:4|
|94489280512|  e,f,g|f,g,h|  e,f,g,f,g,h|h:2,g:4,f:4,e:2|
+-----------+-------+-----+-------------+---------------+

Please note that the json that we created in the udf part is usually way more handy to use than the simple string in the grouped_counts column using native pyspark functions.请注意,我们在 udf 部分创建的 json 通常比grouped_counts列中使用原生 pyspark 函数的简单字符串更方便使用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM