如何从 pyspark dataframe 列中的列表中删除特定字符串

Question

I have the below python list.我有以下 python 列表。

lst=['name','age','country '] lst=['name','age','country ']

Spark dataframe is below.火花 dataframe 如下。

column_a
name Xxxx, age 23, country aaaa
name yyyy, age 25, country bbbb

I have to compare the list with spark dataframe string column and remove the values from list from the column.我必须将列表与 spark dataframe 字符串列进行比较，并从列中删除列表中的值。

Expected output is:预期的 output 是：

column_a
Xxxx, 23, aaaa
yyyy, 25, bbbb

Answer 1

Just in case you don't want to import any additional modules you can also use something like this:万一你不想导入任何额外的模块，你也可以使用这样的东西：

df['column_a'] = df['column_a'].apply(lambda x: ''.join([i for i in x.split() if i not in lst]))

Answer 2

You can use regexp_replace with '|'.join() .您可以将regexp_replace与'|'.join()一起使用。 The first is commonly used to replace substring matches.第一种是常用来代替substring火柴。 The latter will join the different elements of the list with |后者将使用|加入列表的不同元素。 . . The combination of the two will remove any parts of your column that are present in your list.两者的组合将删除列表中存在的列的任何部分。

import pyspark.sql.functions as F

df = df.withColumn('column_a', F.regexp_replace('column_a', '|'.join(lst), ''))

如何从 pyspark dataframe 列中的列表中删除特定字符串

问题描述

2 个解决方案

解决方案1
1 2021-12-29 13:04:30

解决方案2
0 2021-12-29 11:07:55

如何从 pyspark dataframe 列中的列表中删除特定字符串

问题描述

2 个解决方案

解决方案1 1 2021-12-29 13:04:30

解决方案2 0 2021-12-29 11:07:55

解决方案1
1 2021-12-29 13:04:30

解决方案2
0 2021-12-29 11:07:55