[英]How can i remove strings from sentences if string matches with strings in list
I have a pandas.Series
with sentences like this:我有一个pandas.Series
有这样的句子:
0 mi sobrino carlos bajó conmigo el lunes
1 juan antonio es un tio guay
2 voy al cine con ramón
3 pepe el panadero siempre se porta bien conmigo
4 martha me hace feliz todos los días
on the other hand, I have a list of names and surnames like this:另一方面,我有一个这样的名字和姓氏列表:
l = ['juan', 'antonio', 'esther', 'josefa', 'mariano', 'cristina', 'carlos']
I want to match sentences from the series to the names in the list.我想将系列中的句子与列表中的名称相匹配。 The real data is much much bigger than this examples, so I thought that element-wise comparison between the series and the list was not going to be efficient, so I created a big string containing all the strings in the name list like this:真实数据比这个例子大得多,所以我认为系列和列表之间的元素比较不会有效,所以我创建了一个包含名称列表中所有字符串的大字符串,如下所示:
'|'.join(l)
I tried to create a boolean mask that later allows me to index the sentences that contains the names in the name list by true or false value like this:我尝试创建一个 boolean 掩码,稍后允许我通过 true 或 false 值对包含名称列表中名称的句子进行索引,如下所示:
series.apply(lambda x: x in '|'.join(l))
but it returns:但它返回:
0 False
1 False
2 False
3 False
4 False
which is clearly not ok.这显然不行。
I also tried using str.contains()
but it doesn't behave as I expect, because this method will look if any substring in the series is present in the name list, and this is not what I need (ie I need an exact match).我也尝试使用str.contains()
但它的行为不像我预期的那样,因为此方法将查看名称列表中是否存在该系列中的任何 substring,这不是我需要的(即我需要一个确切的匹配)。
Could you please point me in the right direction here?你能在这里指出我正确的方向吗?
Thank you very much in advance非常感谢您提前
If need exact match you can use word boundaries:如果需要完全匹配,您可以使用单词边界:
series.str.contains('|'.join(rf"\b{x}\b" for x in l))
For remove values by list is use generator comprehension with filtering only non matched values by splitted text:对于按列表删除值,使用生成器理解,通过拆分文本仅过滤不匹配的值:
series = series.apply(lambda x: ' '.join(y for y in x.split() if y not in l))
print (series)
0 mi sobrino bajó conmigo el lunes
1 es un tio guay
2 voy al cine con ramón
3 pepe el panadero siempre se porta bien conmigo
4 martha me hace feliz todos los días
import re
data = ["mi sobrino carlos bajó conmigo el lunes", "juan antonio es un tio guay", "martha me hace feliz todos los días"]
regexs = ['juan', 'antonio', 'esther', 'josefa', 'mariano', 'cristina', 'carlos']
for regex in regexs:
for sentence in data:
if re.match(regex, sentence):
print True
else:
print False
I guess something simple like that could work我想像这样简单的东西可以工作
cf: https://docs.python.org/fr/3/library/re.html cf: https://docs.python.org/fr/3/library/re.html
Regex to check if a word at the start or at the end or in between正则表达式检查单词是否在开头或结尾或两者之间
df = pd.DataFrame({'texts': [
'mi sobrino carlos bajó conmigo el lunes',
'juan antonio es un tio guay',
'voy al cine con ramón',
'pepe el panadero siempre se porta bien conmigo',
'martha me hace feliz todos los días '
]})
names = ['juan', 'antonio', 'esther', 'josefa', 'mariano', 'cristina', 'carlos']
pattern = "|".join([f"^{s}|{s}$|\\b{s}\\b" for s in names])
df[df.apply(lambda x:
x.astype(str).str.contains(pattern, flags=re.I)).any(axis=1)]
one option is set intersection:一个选项是设置交集:
l = set(['juan', 'antonio', 'esther', 'josefa', 'mariano', 'cristina', 'carlos'])
s.apply(lambda x: len(set(x.split()).intersection(l))>0)
For exact match.为精确匹配。 Try;尝试;
df.text.str.contains("|".join(l))
Otherwise, simply use regular expression to replace substring with ''
.否则,只需使用正则表达式将 substring 替换为''
。 Call the substring
using list comprehension使用列表理解调用substring
df.replace(regex=[x for x in l], value='')
text
0 mi sobrino bajó conmigo el lunes
1 es un tio guay
2 voy al cine con ramón
3 pepe el panadero siempre se porta bien conmigo
4 martha me hace feliz todos los días
If you want a little more flexibility for processing, you can have your custom exact_match
function as below:如果您想要更多的处理灵活性,您可以使用自定义的精确匹配exact_match
如下:
import re
def exact_match(text, l=l):
return bool(re.search('|'.join(rf'\b{x}\b' for x in l), text))
series.apply(exact_match)
Output: Output:
0 True
1 True
2 False
3 False
4 False
dtype: bool
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.