简体   繁体   English

如果字符串与列表中的字符串匹配,我如何从句子中删除字符串

[英]How can i remove strings from sentences if string matches with strings in list

I have a pandas.Series with sentences like this:我有一个pandas.Series有这样的句子:

0    mi sobrino carlos bajó conmigo el lunes       
1    juan antonio es un tio guay                   
2    voy al cine con ramón                         
3    pepe el panadero siempre se porta bien conmigo
4    martha me hace feliz todos los días 

on the other hand, I have a list of names and surnames like this:另一方面,我有一个这样的名字和姓氏列表:

l = ['juan', 'antonio', 'esther', 'josefa', 'mariano', 'cristina', 'carlos']

I want to match sentences from the series to the names in the list.我想将系列中的句子与列表中的名称相匹配。 The real data is much much bigger than this examples, so I thought that element-wise comparison between the series and the list was not going to be efficient, so I created a big string containing all the strings in the name list like this:真实数据比这个例子大得多,所以我认为系列和列表之间的元素比较不会有效,所以我创建了一个包含名称列表中所有字符串的大字符串,如下所示:

'|'.join(l)

I tried to create a boolean mask that later allows me to index the sentences that contains the names in the name list by true or false value like this:我尝试创建一个 boolean 掩码,稍后允许我通过 true 或 false 值对包含名称列表中名称的句子进行索引,如下所示:

series.apply(lambda x: x in '|'.join(l))

but it returns:但它返回:

0    False
1    False
2    False
3    False
4    False

which is clearly not ok.这显然不行。

I also tried using str.contains() but it doesn't behave as I expect, because this method will look if any substring in the series is present in the name list, and this is not what I need (ie I need an exact match).我也尝试使用str.contains()但它的行为不像我预期的那样,因为此方法将查看名称列表中是否存在该系列中的任何 substring,这不是我需要的(即我需要一个确切的匹配)。

Could you please point me in the right direction here?你能在这里指出我正确的方向吗?

Thank you very much in advance非常感谢您提前

If need exact match you can use word boundaries:如果需要完全匹配,您可以使用单词边界:

series.str.contains('|'.join(rf"\b{x}\b" for x in l))

For remove values by list is use generator comprehension with filtering only non matched values by splitted text:对于按列表删除值,使用生成器理解,通过拆分文本仅过滤不匹配的值:

series = series.apply(lambda x: ' '.join(y for y in x.split() if y not in l))
print (series)
                            
0                mi sobrino bajó conmigo el lunes
1                                  es un tio guay
2                           voy al cine con ramón
3  pepe el panadero siempre se porta bien conmigo
4             martha me hace feliz todos los días
import re

data = ["mi sobrino carlos bajó conmigo el lunes", "juan antonio es un tio guay", "martha me hace feliz todos los días"]

regexs = ['juan', 'antonio', 'esther', 'josefa', 'mariano', 'cristina', 'carlos']

for regex in regexs:

    for sentence in data:

        if re.match(regex, sentence):

            print True
        
        else:

            print False

I guess something simple like that could work我想像这样简单的东西可以工作

cf: https://docs.python.org/fr/3/library/re.html cf: https://docs.python.org/fr/3/library/re.html

Regex to check if a word at the start or at the end or in between正则表达式检查单词是否在开头或结尾或两者之间

df = pd.DataFrame({'texts': [
                             'mi sobrino carlos bajó conmigo el lunes',
                             'juan antonio es un tio guay',
                             'voy al cine con ramón',
                             'pepe el panadero siempre se porta bien conmigo',
                             'martha me hace feliz todos los días '
                             ]})

names = ['juan', 'antonio', 'esther', 'josefa', 'mariano', 'cristina', 'carlos']

pattern = "|".join([f"^{s}|{s}$|\\b{s}\\b" for s in names])

df[df.apply(lambda x: 
            x.astype(str).str.contains(pattern, flags=re.I)).any(axis=1)]

one option is set intersection:一个选项是设置交集:

l = set(['juan', 'antonio', 'esther', 'josefa', 'mariano', 'cristina', 'carlos'])
s.apply(lambda x: len(set(x.split()).intersection(l))>0)

For exact match.为精确匹配。 Try;尝试;

df.text.str.contains("|".join(l))

Otherwise, simply use regular expression to replace substring with '' .否则,只需使用正则表达式将 substring 替换为'' Call the substring using list comprehension使用列表理解调用substring

df.replace(regex=[x for x in l], value='')
                          

                                   text
0               mi sobrino  bajó conmigo el lunes
1                                  es un tio guay
2                           voy al cine con ramón
3  pepe el panadero siempre se porta bien conmigo
4             martha me hace feliz todos los días

If you want a little more flexibility for processing, you can have your custom exact_match function as below:如果您想要更多的处理灵活性,您可以使用自定义的精确匹配exact_match如下:

import re 

def exact_match(text, l=l):
    return bool(re.search('|'.join(rf'\b{x}\b' for x in l), text))

series.apply(exact_match)

Output: Output:

0     True
1     True
2    False
3    False
4    False
dtype: bool

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何从列表中的某些字符串中删除特定字符串? - How do I remove a specific string from some strings in a list? 如何获取字符串列表并查找名称与列表中的字符串匹配的文件? - How can I take a list of strings and find files who name matches a string in the list? 如果这些字符串略有不同,如何根据另一个列表中的字符串从列表中删除某些字符串? 更多信息如下 - How can I remove certain strings from a list based on the strings in another list, if those strings differ slightly? More info below 从字符串列表的列中删除字符串列表 - remove a list of string from a column of list of strings 如何将字符串列表转换为句子列表? - How to transform a list of strings into a list of sentences? 如何删除字符串列表中列出的所有字符串,但只能删除空格后面的字符串 - How can i remove all string that listed on a list of strings, but only the one that followed by space 如何从包含子字符串的列表中删除字符串? - How to remove strings from a list which contain a sub-string? 从字符串列表中删除空字符串值 - Remove blank string value from a list of strings Python:从字符串列表中删除一部分字符串 - Python: Remove a portion of a string from a list of strings 在Python中,如何基于字符串列表从列表中删除项目? - In Python, how can I remove items from a list based on a list of strings?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM