[英]Match pyspark dataframe column to list and create a new column
我有以下列表。
lst=['name','age','country']
我有以下pyspark
dataframe
column_a column_b
Aaaa name,age,subject
Bbbb name,age,country,subject
Cccc name,subject,percentage
我必須將列表與column_b
進行比較,並檢查列表中的值是否是列的一部分,然后創建一個新列並使用column_b
中可用的列表中的值填充它。
下面是預期的 output。
column_a column_b column_c
Aaaa name,age,subject name,age
Bbbb name,age,country,subject name,age,country
Cccc name,subject,percentage name
array_intersect
允許您想要實現的操作。
array_intersect
不允許重復,(即)如果column_b
的值為["name", "name"]
則column_c
將包含一次["name"]
。
from pyspark.sql import functions as F
data = [("Aaaa", ["name", "age", "subject"],),
("Bbbb", ["name", "age", "country", "subject"],),
("Cccc", ["name", "subject", "percentage"],),
("Dddd", ["name", "name"],),]
df = spark.createDataFrame(data, ("column_a", "column_b",))
lst=['name','age','country']
lit_lst = [F.lit(v) for v in lst]
df.withColumn("column_c", F.array_intersect(F.col("column_b"), F.array(lit_lst))).show(truncate=False)
+--------+-----------------------------+--------------------+
|column_a|column_b |column_c |
+--------+-----------------------------+--------------------+
|Aaaa |[name, age, subject] |[name, age] |
|Bbbb |[name, age, country, subject]|[name, age, country]|
|Cccc |[name, subject, percentage] |[name] |
|Dddd |[name, name] |[name] |
+--------+-----------------------------+--------------------+
要保留重復項,可以應用filter
高階 Function。
from pyspark.sql import functions as F
data = [("Aaaa", ["name", "age", "subject"],),
("Bbbb", ["name", "age", "country", "subject"],),
("Cccc", ["name", "subject", "percentage"],),
("Dddd", ["name", "name"],),]
df = spark.createDataFrame(data, ("column_a", "column_b",))
df.withColumn("column_c", F.array(lit_lst))\
.withColumn("column_c", F.expr("filter(column_b, element -> array_contains(column_c, element))"))\
.show(truncate=False)
+--------+-----------------------------+--------------------+
|column_a|column_b |column_c |
+--------+-----------------------------+--------------------+
|Aaaa |[name, age, subject] |[name, age] |
|Bbbb |[name, age, country, subject]|[name, age, country]|
|Cccc |[name, subject, percentage] |[name] |
|Dddd |[name, name] |[name, name] |
+--------+-----------------------------+--------------------+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.