简体   繁体   English

从字符串中删除短语列表

[英]Remove a list of phrase from string

I have a list of phrases (n-grams) that need to be removed from a given sentence.我有一个需要从给定句子中删除的短语(n-gram)列表。

    removed = ['range', 'drinks', 'food and drinks', 'summer drinks']
    sentence = 'Oranges are the main ingredient for a wide range of food and drinks'

I want to get:我想得到:

    new_sentence = 'Oranges are the main ingredient for a wide of'

I tried Remove list of phrases from string but it doesn't work ('Oranges' turns into 'Os', 'drinks' is removed instead of a phrase 'food and drinks')我尝试从字符串中删除短语列表,但它不起作用('Oranges' 变成'Os','drinks' 被删除而不是短语'food and Drinks')

Does anyone know how to solve it?有谁知道如何解决它? Thank you!谢谢!

Since you want to match on whole words only, I think the first step is to turn everything into lists of words, and then iterate from longest to shortest phrase in order to find things to remove:由于您只想匹配整个单词,我认为第一步是将所有内容转换为单词列表,然后从最长到最短的短语进行迭代以找到要删除的内容:

>>> removed = ['range', 'drinks', 'food and drinks', 'summer drinks']
>>> sentence = 'Oranges are the main ingredient for a wide range of food and drinks'
>>> words = sentence.split()
>>> for ngram in sorted([r.split() for r in removed], key=len, reverse=True):
...     for i in range(len(words) - len(ngram)+1):
...         if words[i:i+len(ngram)] == ngram:
...             words = words[:i] + words[i+len(ngram):]
...             break
...
>>> " ".join(words)
'Oranges are the main ingredient for a wide of'

Note that there are some flaws with this simple approach -- multiple copies of the same n-gram won't be removed, but you can't continue with that loop after modifying words either (the length will be different), so if you want to handle duplicates, you'll need to batch the updates.请注意,这种简单的方法存在一些缺陷——不会删除同一个 n-gram 的多个副本,但是在修改words后也不能继续该循环(长度会不同),所以如果你要处理重复项,您需要批量更新。

Regular expression time!正则表达式时间!

In [116]: removed = ['range', 'drinks', 'food and drinks', 'summer drinks']
     ...: removed = sorted(removed, key=len, reverse=True)
     ...: sentence = 'Oranges are the main ingredient for a wide range of food and drinks'
     ...: new_sentence = sentence
     ...: import re
     ...: removals = [r'\b' + phrase + r'\b' for phrase in removed]
     ...: for removal in removals:
     ...:     new_sentence = re.sub(removal, '', new_sentence)
     ...: new_sentence = ' '.join(new_sentence.split())
     ...: print(sentence)
     ...: print(new_sentence)
Oranges are the main ingredient for a wide range of food and drinks
Oranges are the main ingredient for a wide of
    import re

    removed = ['range', 'drinks', 'food and drinks', 'summer drinks']
    sentence = 'Oranges are the main ingredient for a wide range of food and drinks'

    # sort the removed tokens according to their length,
    removed = sorted(removed, key=len, reverse=True)

    # using word boundaries
    for r in removed:
        sentence = re.sub(r"\b{}\b".format(r), " ", sentence)

    # replace multiple whitspaces with a single one   
    sentence = re.sub(' +',' ',sentence)

I hope this would help: first, you need to sort the removed strings according to their length, in this way 'food and drinks' will be replaced before 'drinks'我希望这会有所帮助:首先,您需要根据长度对删除的字符串进行排序,这样 'food and Drinks' 将在 'drinks' 之前被替换

Here you go给你 go

removed = ['range', 'drinks', 'food and drinks', 'summer drinks','are']
sentence = 'Oranges are the main ingredient for a wide range of food and drinks'

words = sentence.split()
resultwords  = [word for word in words if word.lower() not in removed]
result = ' '.join(resultwords)
print(result)

Results:结果:

Oranges the main ingredient for a wide of food and

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM