[英]remove a list of strings from a series of strings
Goal: Remove items from my list, strings_2_remove
, from a series. 目标:从系列列表中删除我的列表中的项
strings_2_remove
。 I have a list
of strings
like so: 我有一个像这样的
strings
list
:
strings_2_remove = [
"dogs are so cool",
"cats have cute toe beans"
]
I also have a series
of strings
that looks like this: 我还有
series
看起来像这样的strings
:
df.Sentences.head()
0 dogs are so cool because they are nice and funny
1 many people love cats because cats have cute toe beans
2 hamsters are very small and furry creatures
3 i got a dog because i know dogs are so cool because they are nice and funny
4 birds are funny when they dance to music, they bop up and down
Name: Summary, dtype: object
The outcome after removing the strings
in the list
from the series
should look like this: 从
series
list
删除strings
后的结果应如下所示:
0 because they are nice and funny
1 many people love cats because
2 hamsters are very small and furry creatures
3 i got a dog because i know because they are nice and funny
4 birds are funny when they dance to music, they bop up and down
Name: Summary, dtype: object
I have the following in attempt to achieve the output I want: 我尝试以下方法以实现所需的输出:
mask_1 = (df.Sentences == strings_2_remove)
df.loc[mask_1, 'df.Sentences'] = " "
However, it is not achieving my goal. 但是,这没有实现我的目标。
Any suggestions? 有什么建议么?
df.Sentences.apply(lambda x: re.sub('|'.join(strings_2_remove),'',x))
Use Series.replace
: 使用
Series.replace
:
df.Sentences.replace('|'.join(strings_2_remove), '', regex=True)
0 because they are nice and funny
1 many people love cats because
2 hamsters are very small and furry creatures
3 i got a dog because i know because they are n...
4 birds are funny when they dance to music, they...
Name: Sentences, dtype: object
I created the test Dataframe as: 我将测试数据框创建为:
df = pd.DataFrame({ 'Summary':[
'dogs are so cool because they are nice and funny',
'many people love cats because cats have cute toe beans',
'hamsters are very small and furry creatures',
'i got a dog because i know dogs are so cool because they are nice and funny',
'birds are funny when they dance to music, they bop up and down']})
The first step is to convert your strings_2_remove
to a list of patterns (you have to import re
): 第一步是将您的
strings_2_remove
转换为模式列表(您必须import re
):
pats = [ re.compile(str + ' *') for str in strings_2_remove ]
Note that each pattern is supplemented with ' *'
- an optional space. 请注意,每个模式都用
' *'
补充-可选空格。 Otherwise the result string could contain two adjacent spaces. 否则结果字符串可能包含两个相邻的空格。 As I see, other solution missed on this detail.
如我所见,其他解决方案在此细节上遗漏了。
Then define a function to be applied: 然后定义一个要应用的函数:
def fn(txt):
for pat in pats:
if pat.search(txt):
return pat.sub('', txt)
return txt
For each pattern it searches the source string and if something has been found then returns the result of substitution of the matched string with an empty string. 对于每个模式,它将搜索源字符串,如果找到了某些内容,则返回将匹配字符串替换为空字符串的结果。 Otherwise it returns the source string.
否则,它将返回源字符串。
And the only thing to do is to apply this function: 唯一要做的就是应用此功能:
df.Summary.apply(fn)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.