[英]Match pyspark dataframe column to list and create a new column
我有以下列表。
lst=['name','age','country']
我有以下pyspark
dataframe
column_a column_b
Aaaa name,age,subject
Bbbb name,age,country,subject
Cccc name,subject,percentage
我必须将列表与column_b
进行比较,并检查列表中的值是否是列的一部分,然后创建一个新列并使用column_b
中可用的列表中的值填充它。
下面是预期的 output。
column_a column_b column_c
Aaaa name,age,subject name,age
Bbbb name,age,country,subject name,age,country
Cccc name,subject,percentage name
array_intersect
允许您想要实现的操作。
array_intersect
不允许重复,(即)如果column_b
的值为["name", "name"]
则column_c
将包含一次["name"]
。
from pyspark.sql import functions as F
data = [("Aaaa", ["name", "age", "subject"],),
("Bbbb", ["name", "age", "country", "subject"],),
("Cccc", ["name", "subject", "percentage"],),
("Dddd", ["name", "name"],),]
df = spark.createDataFrame(data, ("column_a", "column_b",))
lst=['name','age','country']
lit_lst = [F.lit(v) for v in lst]
df.withColumn("column_c", F.array_intersect(F.col("column_b"), F.array(lit_lst))).show(truncate=False)
+--------+-----------------------------+--------------------+
|column_a|column_b |column_c |
+--------+-----------------------------+--------------------+
|Aaaa |[name, age, subject] |[name, age] |
|Bbbb |[name, age, country, subject]|[name, age, country]|
|Cccc |[name, subject, percentage] |[name] |
|Dddd |[name, name] |[name] |
+--------+-----------------------------+--------------------+
要保留重复项,可以应用filter
高阶 Function。
from pyspark.sql import functions as F
data = [("Aaaa", ["name", "age", "subject"],),
("Bbbb", ["name", "age", "country", "subject"],),
("Cccc", ["name", "subject", "percentage"],),
("Dddd", ["name", "name"],),]
df = spark.createDataFrame(data, ("column_a", "column_b",))
df.withColumn("column_c", F.array(lit_lst))\
.withColumn("column_c", F.expr("filter(column_b, element -> array_contains(column_c, element))"))\
.show(truncate=False)
+--------+-----------------------------+--------------------+
|column_a|column_b |column_c |
+--------+-----------------------------+--------------------+
|Aaaa |[name, age, subject] |[name, age] |
|Bbbb |[name, age, country, subject]|[name, age, country]|
|Cccc |[name, subject, percentage] |[name] |
|Dddd |[name, name] |[name, name] |
+--------+-----------------------------+--------------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.