简体   繁体   English

从一系列字符串中删除字符串列表

[英]remove a list of strings from a series of strings

Goal: Remove items from my list, strings_2_remove , from a series. 目标:从系列列表中删除我的列表中的项strings_2_remove I have a list of strings like so: 我有一个像这样的strings list

strings_2_remove = [
"dogs are so cool",
"cats have cute toe beans"
]

I also have a series of strings that looks like this: 我还有series看起来像这样的strings

df.Sentences.head()

0    dogs are so cool because they are nice and funny 
1    many people love cats because cats have cute toe beans
2    hamsters are very small and furry creatures
3    i got a dog because i know dogs are so cool because they are nice and funny
4    birds are funny when they dance to music, they bop up and down
Name: Summary, dtype: object

The outcome after removing the strings in the list from the series should look like this: series list删除strings后的结果应如下所示:

    0    because they are nice and funny 
    1    many people love cats because 
    2    hamsters are very small and furry creatures
    3    i got a dog because i know because they are nice and funny
    4    birds are funny when they dance to music, they bop up and down
    Name: Summary, dtype: object

I have the following in attempt to achieve the output I want: 我尝试以下方法以实现所需的输出:

mask_1 = (df.Sentences == strings_2_remove)
df.loc[mask_1, 'df.Sentences'] = " "

However, it is not achieving my goal. 但是,这没有实现我的目标。

Any suggestions? 有什么建议么?

Try: 尝试:

result = df.Sentences
for stringToRemove in strings_2_remove:
    result = result.replace(stringToRemove, '', regex=False)

There are better, more performant solutions using RegEx. 使用RegEx有更好,性能更高的解决方案。 More information here . 更多信息在这里

df.Sentences.apply(lambda x: re.sub('|'.join(strings_2_remove),'',x))

Use Series.replace : 使用Series.replace

df.Sentences.replace('|'.join(strings_2_remove), '', regex=True)

0                      because they are nice and funny
1                       many people love cats because 
2          hamsters are very small and furry creatures
3    i got a dog because i know  because they are n...
4    birds are funny when they dance to music, they...
Name: Sentences, dtype: object

I created the test Dataframe as: 我将测试数据框创建为:

df = pd.DataFrame({ 'Summary':[
    'dogs are so cool because they are nice and funny',
    'many people love cats because cats have cute toe beans',
    'hamsters are very small and furry creatures',
    'i got a dog because i know dogs are so cool because they are nice and funny',
    'birds are funny when they dance to music, they bop up and down']})

The first step is to convert your strings_2_remove to a list of patterns (you have to import re ): 第一步是将您的strings_2_remove转换为模式列表(您必须import re ):

pats = [ re.compile(str + ' *') for str in strings_2_remove ]

Note that each pattern is supplemented with ' *' - an optional space. 请注意,每个模式都用' *'补充-可选空格。 Otherwise the result string could contain two adjacent spaces. 否则结果字符串可能包含两个相邻的空格。 As I see, other solution missed on this detail. 如我所见,其他解决方案在此细节上遗漏了。

Then define a function to be applied: 然后定义一个要应用的函数:

def fn(txt):
    for pat in pats:
        if pat.search(txt):
            return pat.sub('', txt)
    return txt

For each pattern it searches the source string and if something has been found then returns the result of substitution of the matched string with an empty string. 对于每个模式,它将搜索源字符串,如果找到了某些内容,则返回将匹配字符串替换为空字符串的结果。 Otherwise it returns the source string. 否则,它将返回源字符串。

And the only thing to do is to apply this function: 唯一要做的就是应用此功能:

df.Summary.apply(fn)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM