如果字符串与列表中的字符串匹配，我如何从句子中删除字符串

Question

I have a pandas.Series with sentences like this:我有一个pandas.Series有这样的句子：

0    mi sobrino carlos bajó conmigo el lunes       
1    juan antonio es un tio guay                   
2    voy al cine con ramón                         
3    pepe el panadero siempre se porta bien conmigo
4    martha me hace feliz todos los días

on the other hand, I have a list of names and surnames like this:另一方面，我有一个这样的名字和姓氏列表：

l = ['juan', 'antonio', 'esther', 'josefa', 'mariano', 'cristina', 'carlos']

I want to match sentences from the series to the names in the list.我想将系列中的句子与列表中的名称相匹配。 The real data is much much bigger than this examples, so I thought that element-wise comparison between the series and the list was not going to be efficient, so I created a big string containing all the strings in the name list like this:真实数据比这个例子大得多，所以我认为系列和列表之间的元素比较不会有效，所以我创建了一个包含名称列表中所有字符串的大字符串，如下所示：

'|'.join(l)

I tried to create a boolean mask that later allows me to index the sentences that contains the names in the name list by true or false value like this:我尝试创建一个 boolean 掩码，稍后允许我通过 true 或 false 值对包含名称列表中名称的句子进行索引，如下所示：

series.apply(lambda x: x in '|'.join(l))

but it returns:但它返回：

0    False
1    False
2    False
3    False
4    False

which is clearly not ok.这显然不行。

I also tried using str.contains() but it doesn't behave as I expect, because this method will look if any substring in the series is present in the name list, and this is not what I need (ie I need an exact match).我也尝试使用str.contains()但它的行为不像我预期的那样，因为此方法将查看名称列表中是否存在该系列中的任何 substring，这不是我需要的（即我需要一个确切的匹配）。

Could you please point me in the right direction here?你能在这里指出我正确的方向吗？

Thank you very much in advance非常感谢您提前

Answer 1

If need exact match you can use word boundaries:如果需要完全匹配，您可以使用单词边界：

series.str.contains('|'.join(rf"\b{x}\b" for x in l))

For remove values by list is use generator comprehension with filtering only non matched values by splitted text:对于按列表删除值，使用生成器理解，通过拆分文本仅过滤不匹配的值：

series = series.apply(lambda x: ' '.join(y for y in x.split() if y not in l))
print (series)
                            
0                mi sobrino bajó conmigo el lunes
1                                  es un tio guay
2                           voy al cine con ramón
3  pepe el panadero siempre se porta bien conmigo
4             martha me hace feliz todos los días

Answer 2

import re

data = ["mi sobrino carlos bajó conmigo el lunes", "juan antonio es un tio guay", "martha me hace feliz todos los días"]

regexs = ['juan', 'antonio', 'esther', 'josefa', 'mariano', 'cristina', 'carlos']

for regex in regexs:

    for sentence in data:

        if re.match(regex, sentence):

            print True
        
        else:

            print False

I guess something simple like that could work我想像这样简单的东西可以工作

cf: https://docs.python.org/fr/3/library/re.html cf: https://docs.python.org/fr/3/library/re.html

Answer 3

Regex to check if a word at the start or at the end or in between正则表达式检查单词是否在开头或结尾或两者之间

df = pd.DataFrame({'texts': [
                             'mi sobrino carlos bajó conmigo el lunes',
                             'juan antonio es un tio guay',
                             'voy al cine con ramón',
                             'pepe el panadero siempre se porta bien conmigo',
                             'martha me hace feliz todos los días '
                             ]})

names = ['juan', 'antonio', 'esther', 'josefa', 'mariano', 'cristina', 'carlos']

pattern = "|".join([f"^{s}|{s}$|\\b{s}\\b" for s in names])

df[df.apply(lambda x: 
            x.astype(str).str.contains(pattern, flags=re.I)).any(axis=1)]

Answer 4

one option is set intersection:一个选项是设置交集：

l = set(['juan', 'antonio', 'esther', 'josefa', 'mariano', 'cristina', 'carlos'])
s.apply(lambda x: len(set(x.split()).intersection(l))>0)

Answer 5

For exact match.为精确匹配。 Try;尝试;

df.text.str.contains("|".join(l))

Otherwise, simply use regular expression to replace substring with '' .否则，只需使用正则表达式将 substring 替换为'' 。 Call the substring using list comprehension使用列表理解调用substring

df.replace(regex=[x for x in l], value='')
                          

                                   text
0               mi sobrino  bajó conmigo el lunes
1                                  es un tio guay
2                           voy al cine con ramón
3  pepe el panadero siempre se porta bien conmigo
4             martha me hace feliz todos los días

Answer 6

If you want a little more flexibility for processing, you can have your custom exact_match function as below:如果您想要更多的处理灵活性，您可以使用自定义的精确匹配exact_match如下：

import re 

def exact_match(text, l=l):
    return bool(re.search('|'.join(rf'\b{x}\b' for x in l), text))

series.apply(exact_match)

Output: Output：

0     True
1     True
2    False
3    False
4    False
dtype: bool

如果字符串与列表中的字符串匹配，我如何从句子中删除字符串

问题描述

6 个解决方案

解决方案1
3 已采纳 2020-07-22 10:54:07

解决方案2
1 2020-07-22 10:55:50

解决方案3
1 2020-07-22 10:59:57

解决方案4
1 2020-07-22 11:02:29

解决方案5
1 2020-07-22 11:11:58

解决方案6
1 2020-07-22 11:39:52

如果字符串与列表中的字符串匹配，我如何从句子中删除字符串

问题描述

6 个解决方案

解决方案1 3 已采纳 2020-07-22 10:54:07

解决方案2 1 2020-07-22 10:55:50

解决方案3 1 2020-07-22 10:59:57

解决方案4 1 2020-07-22 11:02:29

解决方案5 1 2020-07-22 11:11:58

解决方案6 1 2020-07-22 11:39:52

解决方案1
3 已采纳 2020-07-22 10:54:07

解决方案2
1 2020-07-22 10:55:50

解决方案3
1 2020-07-22 10:59:57

解决方案4
1 2020-07-22 11:02:29

解决方案5
1 2020-07-22 11:11:58

解决方案6
1 2020-07-22 11:39:52