[英]How to substract two string columns in pyspark dataframe?
我想减去column1 - column2
,即从column1
中删除column2
中匹配的所有子字符串,并将结果放入新列result
中。
pyspark数据帧:
+--+-------------------------+--------------------------+--------------+
|ID| column1 | column2 | result |
+--+-------------------------+--------------------------+--------------+
|1 | Hi how are you fine but | Hi I am fine how about u | are you but |
|2 | javascript python XML | python XML | javascript |
|3 | include all the inform | include inform | all the |
+--+-------------------------+--------------------------+--------------+
您可以使用array_except
从column1
中删除colmun2
中存在的所有子字符串:
from pyspark.sql import functions as F
df1 = df.withColumn(
"result",
F.array_join(
F.array_except(F.split("column1", " "), F.split("column2", " ")),
" "
)
)
df1.show(truncate=False)
#+---+-----------------------+------------------------+-----------+
#|ID |column1 |column2 |result |
#+---+-----------------------+------------------------+-----------+
#|1 |Hi how are you fine but|Hi I am fine how about u|are you but|
#|2 |javascript python XML |python XML |javascript |
#|3 |include all the inform |include inform |all the |
#+---+-----------------------+------------------------+-----------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.