如何减去 pyspark dataframe 中的两个字符串列？

Question

我想减去column1 - column2 ，即从column1中删除column2中匹配的所有子字符串，并将结果放入新列result中。

pyspark数据帧：

+--+-------------------------+--------------------------+--------------+
|ID|           column1       |   column2                | result       |
+--+-------------------------+--------------------------+--------------+
|1 | Hi how are you fine but | Hi I am fine how about u | are you but  |
|2 | javascript python XML   | python XML               | javascript   |
|3 | include all the inform  | include inform           | all the      |
+--+-------------------------+--------------------------+--------------+

Answer 1

您可以使用array_except从column1中删除colmun2中存在的所有子字符串：

from pyspark.sql import functions as F

df1 = df.withColumn(
    "result",
    F.array_join(
        F.array_except(F.split("column1", " "), F.split("column2", " ")),
        " "
    )
)

df1.show(truncate=False)

#+---+-----------------------+------------------------+-----------+
#|ID |column1                |column2                 |result     |
#+---+-----------------------+------------------------+-----------+
#|1  |Hi how are you fine but|Hi I am fine how about u|are you but|
#|2  |javascript python XML  |python XML              |javascript |
#|3  |include all the inform |include inform          |all the    |
#+---+-----------------------+------------------------+-----------+

如何减去 pyspark dataframe 中的两个字符串列？

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-02-04 17:59:21

如何减去 pyspark dataframe 中的两个字符串列？

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-02-04 17:59:21

解决方案1
1 已采纳 2021-02-04 17:59:21