简体   繁体   中英

Count occurrences of list of values in column using PySpark DataFrame

I have a PySpark DataFrame with a string column text and a separate list word_list and I need to count how many of the word_list values appear in each text row (can be counted more than once).

df = spark.createDataFrame(
  [(1,'Hello my name is John'), 
   (2,'Yo go Bengals'), 
   (3,'this is a text')
  ]
  , ['id','text']
)

word_list = ['is', 'm', 'o', 'my']

The result would be:

| text                  | list_count   |
| Hello my name is John |       6      |
| Yo go Bengals         |       2      |
| this is a text        |       2      | 

For text's first value, "is" occurs once, "m" occurs twice, "o" occurs twice, and "my" occurs once. In the second row, the only value from word_list that appears is "o" and it appears twice. In the third value for text , the only value from word_list that appears is "is" and it appears twice.

The result doesn't necessarily have to be PySpark-based either, it could be in Pandas if that's easier.

You can do this with a UDF as below

UDF

df = sql.createDataFrame(
  [(1,'Hello my name is John'), 
   (2,'Yo go Bengals'), 
   (3,'this is a text')
  ]
  , ['id','text']
)

word_list = ['is', 'm', 'o', 'my']

def count_values(inp,map_list=None):
    
    count = 0
    
    for pattern in map_list:
        if re.findall(pattern,inp):
            count += 1
    
    return count

count_values_udf = F.udf(partial(count_values,map_list=word_list),IntegerType())

df.select(
            count_values_udf(F.col('text')).alias('count_values')
         ).show()

+------------+
|count_values|
+------------+
|           4|
|           1|
|           1|
+------------+

To count number of occurrences of a substring in string column, you can split that column by the substring. The count corresponds to the size of resulting array minus 1.

So in your case, you can use aggregate function on the word_list array column and for each element, split the text column and get the size - 1 :

from pyspark.sql import functions as F

result = df.withColumn(
    "word_list",
    F.array(*[F.lit(x) for x in word_list])
).withColumn(
    "list_count",
    F.expr("aggregate(word_list, 0, (acc, x) -> acc + size(split(text, x)) -1)")
).drop("word_list")

result.show(truncate=False)
#+---+---------------------+----------+
#|id |text                 |list_count|
#+---+---------------------+----------+
#|1  |Hello my name is John|6         |
#|2  |Yo go Bengals        |2         |
#|3  |this is a text       |2         |
#+---+---------------------+----------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM