[英]Match pyspark dataframe column to list and create a new column
I have the below list.我有以下列表。
lst=['name','age','country']
I have the below pyspark
dataframe
我有以下
pyspark
dataframe
column_a column_b
Aaaa name,age,subject
Bbbb name,age,country,subject
Cccc name,subject,percentage
I have to compare the list with column_b
and check if the values in the list are part of the column and create a new column and populate it with the values from the list that are available in column_b
.我必须将列表与
column_b
进行比较,并检查列表中的值是否是列的一部分,然后创建一个新列并使用column_b
中可用的列表中的值填充它。
Below is the expected output.下面是预期的 output。
column_a column_b column_c
Aaaa name,age,subject name,age
Bbbb name,age,country,subject name,age,country
Cccc name,subject,percentage name
array_intersect
allows for the operation you want to achieve. array_intersect
允许您想要实现的操作。
array_intersect
does not allow for duplicates, (ie, ) ifcolumn_b
had a value of["name", "name"]
thencolumn_c
would contain["name"]
once.array_intersect
不允许重复,(即)如果column_b
的值为["name", "name"]
则column_c
将包含一次["name"]
。
from pyspark.sql import functions as F
data = [("Aaaa", ["name", "age", "subject"],),
("Bbbb", ["name", "age", "country", "subject"],),
("Cccc", ["name", "subject", "percentage"],),
("Dddd", ["name", "name"],),]
df = spark.createDataFrame(data, ("column_a", "column_b",))
lst=['name','age','country']
lit_lst = [F.lit(v) for v in lst]
df.withColumn("column_c", F.array_intersect(F.col("column_b"), F.array(lit_lst))).show(truncate=False)
+--------+-----------------------------+--------------------+
|column_a|column_b |column_c |
+--------+-----------------------------+--------------------+
|Aaaa |[name, age, subject] |[name, age] |
|Bbbb |[name, age, country, subject]|[name, age, country]|
|Cccc |[name, subject, percentage] |[name] |
|Dddd |[name, name] |[name] |
+--------+-----------------------------+--------------------+
To preserve duplicates, filter
Higher Order Function can be applied.要保留重复项,可以应用
filter
高阶 Function。
from pyspark.sql import functions as F
data = [("Aaaa", ["name", "age", "subject"],),
("Bbbb", ["name", "age", "country", "subject"],),
("Cccc", ["name", "subject", "percentage"],),
("Dddd", ["name", "name"],),]
df = spark.createDataFrame(data, ("column_a", "column_b",))
df.withColumn("column_c", F.array(lit_lst))\
.withColumn("column_c", F.expr("filter(column_b, element -> array_contains(column_c, element))"))\
.show(truncate=False)
+--------+-----------------------------+--------------------+
|column_a|column_b |column_c |
+--------+-----------------------------+--------------------+
|Aaaa |[name, age, subject] |[name, age] |
|Bbbb |[name, age, country, subject]|[name, age, country]|
|Cccc |[name, subject, percentage] |[name] |
|Dddd |[name, name] |[name, name] |
+--------+-----------------------------+--------------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.