[英]Intersect each row of a pyspark DataFrame which is a list of strings with a master list of strings?
Say I have a DataFrame like this.假设我有一个这样的 DataFrame。
[Row(case_number='5307793179', word_list=['n', 'b', 'c']),
Row(case_number='5307793171', word_list=['w', 'e', 'c']),
Row(case_number='5307793172', word_list=['1', 'f', 'c']),
Row(case_number='5307793173', word_list=['a', 'k', 'c']),
Row(case_number='5307793174', word_list=['z', 'l', 'c']),
Row(case_number='5307793175', word_list=['b', 'r', 'c'])]
And a master word list like this:和一个像这样的主词表:
master_word_list = ['b', 'c']
Is there a sleek way to filter word_list against master_word_list so the resulting pyspark dataframe looks like this.是否有一种时尚的方法可以根据 master_word_list 过滤 word_list,因此生成的 pyspark 数据框如下所示。 (by sleek I mean without using UDFs, if UDFs are the best/only way, I'd accept that as a solution as well)
(时尚我的意思是不使用 UDF,如果 UDF 是最好/唯一的方式,我也会接受它作为解决方案)
[Row(case_number='5307793179', word_list=['b', 'c']),
Row(case_number='5307793171', word_list=['c']),
Row(case_number='5307793172', word_list=['c']),
Row(case_number='5307793173', word_list=['c']),
Row(case_number='5307793174', word_list=['c']),
Row(case_number='5307793175', word_list=['b', 'c'])]
array_intersect
available since Spark 2.4: array_intersect
从 Spark 2.4 开始可用:
pyspark.sql.functions.array_intersect(col1, col2)
Collection function: returns an array of the elements in the intersection of col1 and col2, without duplicates.
集合函数:返回col1和col2交集的元素组成的数组,没有重复。
Parameters:
参数:
- col1 – name of column containing array
col1 – 包含数组的列的名称
- col2 – name of column containing array
col2 – 包含数组的列的名称
from pyspark.sql.functions import array, array_intersect, lit
master_word_list_col = array(*[lit(x) for x in master_word_list])
df = spark.createDataFrame(
[("5307793179", ["n", "b", "c"])],
("case_number", "word_list")
)
df.withColumn("word_list", array_intersect("word_list", master_word_list_col)).show()
+-----------+---------+
|case_number|word_list|
+-----------+---------+
| 5307793179| [b, c]|
+-----------+---------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.