![](/img/trans.png)
[英]Faster way to remove a dictionary of phrase from a list of string using Python
[英]Remove a list of phrase from string
我有一个需要从给定句子中删除的短语(n-gram)列表。
removed = ['range', 'drinks', 'food and drinks', 'summer drinks']
sentence = 'Oranges are the main ingredient for a wide range of food and drinks'
我想得到:
new_sentence = 'Oranges are the main ingredient for a wide of'
我尝试从字符串中删除短语列表,但它不起作用('Oranges' 变成'Os','drinks' 被删除而不是短语'food and Drinks')
有谁知道如何解决它? 谢谢!
由于您只想匹配整个单词,我认为第一步是将所有内容转换为单词列表,然后从最长到最短的短语进行迭代以找到要删除的内容:
>>> removed = ['range', 'drinks', 'food and drinks', 'summer drinks']
>>> sentence = 'Oranges are the main ingredient for a wide range of food and drinks'
>>> words = sentence.split()
>>> for ngram in sorted([r.split() for r in removed], key=len, reverse=True):
... for i in range(len(words) - len(ngram)+1):
... if words[i:i+len(ngram)] == ngram:
... words = words[:i] + words[i+len(ngram):]
... break
...
>>> " ".join(words)
'Oranges are the main ingredient for a wide of'
请注意,这种简单的方法存在一些缺陷——不会删除同一个 n-gram 的多个副本,但是在修改words
后也不能继续该循环(长度会不同),所以如果你要处理重复项,您需要批量更新。
正则表达式时间!
In [116]: removed = ['range', 'drinks', 'food and drinks', 'summer drinks']
...: removed = sorted(removed, key=len, reverse=True)
...: sentence = 'Oranges are the main ingredient for a wide range of food and drinks'
...: new_sentence = sentence
...: import re
...: removals = [r'\b' + phrase + r'\b' for phrase in removed]
...: for removal in removals:
...: new_sentence = re.sub(removal, '', new_sentence)
...: new_sentence = ' '.join(new_sentence.split())
...: print(sentence)
...: print(new_sentence)
Oranges are the main ingredient for a wide range of food and drinks
Oranges are the main ingredient for a wide of
import re
removed = ['range', 'drinks', 'food and drinks', 'summer drinks']
sentence = 'Oranges are the main ingredient for a wide range of food and drinks'
# sort the removed tokens according to their length,
removed = sorted(removed, key=len, reverse=True)
# using word boundaries
for r in removed:
sentence = re.sub(r"\b{}\b".format(r), " ", sentence)
# replace multiple whitspaces with a single one
sentence = re.sub(' +',' ',sentence)
我希望这会有所帮助:首先,您需要根据长度对删除的字符串进行排序,这样 'food and Drinks' 将在 'drinks' 之前被替换
给你 go
removed = ['range', 'drinks', 'food and drinks', 'summer drinks','are']
sentence = 'Oranges are the main ingredient for a wide range of food and drinks'
words = sentence.split()
resultwords = [word for word in words if word.lower() not in removed]
result = ' '.join(resultwords)
print(result)
结果:
Oranges the main ingredient for a wide of food and
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.