从字符串中删除短语列表

Question

我有一个需要从给定句子中删除的短语（n-gram）列表。

    removed = ['range', 'drinks', 'food and drinks', 'summer drinks']
    sentence = 'Oranges are the main ingredient for a wide range of food and drinks'

我想得到：

    new_sentence = 'Oranges are the main ingredient for a wide of'

我尝试从字符串中删除短语列表，但它不起作用（'Oranges' 变成'Os'，'drinks' 被删除而不是短语'food and Drinks'）

有谁知道如何解决它？ 谢谢！

Answer 1

由于您只想匹配整个单词，我认为第一步是将所有内容转换为单词列表，然后从最长到最短的短语进行迭代以找到要删除的内容：

>>> removed = ['range', 'drinks', 'food and drinks', 'summer drinks']
>>> sentence = 'Oranges are the main ingredient for a wide range of food and drinks'
>>> words = sentence.split()
>>> for ngram in sorted([r.split() for r in removed], key=len, reverse=True):
...     for i in range(len(words) - len(ngram)+1):
...         if words[i:i+len(ngram)] == ngram:
...             words = words[:i] + words[i+len(ngram):]
...             break
...
>>> " ".join(words)
'Oranges are the main ingredient for a wide of'

请注意，这种简单的方法存在一些缺陷——不会删除同一个 n-gram 的多个副本，但是在修改words后也不能继续该循环（长度会不同），所以如果你要处理重复项，您需要批量更新。

Answer 2

正则表达式时间！

In [116]: removed = ['range', 'drinks', 'food and drinks', 'summer drinks']
     ...: removed = sorted(removed, key=len, reverse=True)
     ...: sentence = 'Oranges are the main ingredient for a wide range of food and drinks'
     ...: new_sentence = sentence
     ...: import re
     ...: removals = [r'\b' + phrase + r'\b' for phrase in removed]
     ...: for removal in removals:
     ...:     new_sentence = re.sub(removal, '', new_sentence)
     ...: new_sentence = ' '.join(new_sentence.split())
     ...: print(sentence)
     ...: print(new_sentence)
Oranges are the main ingredient for a wide range of food and drinks
Oranges are the main ingredient for a wide of

Answer 3

    import re

    removed = ['range', 'drinks', 'food and drinks', 'summer drinks']
    sentence = 'Oranges are the main ingredient for a wide range of food and drinks'

    # sort the removed tokens according to their length,
    removed = sorted(removed, key=len, reverse=True)

    # using word boundaries
    for r in removed:
        sentence = re.sub(r"\b{}\b".format(r), " ", sentence)

    # replace multiple whitspaces with a single one   
    sentence = re.sub(' +',' ',sentence)

我希望这会有所帮助：首先，您需要根据长度对删除的字符串进行排序，这样 'food and Drinks' 将在 'drinks' 之前被替换

Answer 4

给你 go

removed = ['range', 'drinks', 'food and drinks', 'summer drinks','are']
sentence = 'Oranges are the main ingredient for a wide range of food and drinks'

words = sentence.split()
resultwords  = [word for word in words if word.lower() not in removed]
result = ' '.join(resultwords)
print(result)

结果：

Oranges the main ingredient for a wide of food and

从字符串中删除短语列表

问题描述

4 个解决方案

解决方案1
1 2020-06-17 22:34:07

解决方案2
0 已采纳 2020-06-17 22:37:38

解决方案3
0 2020-06-17 22:51:21

解决方案4
-2 2020-06-17 22:24:24

从字符串中删除短语列表

问题描述

4 个解决方案

解决方案1 1 2020-06-17 22:34:07

解决方案2 0 已采纳 2020-06-17 22:37:38

解决方案3 0 2020-06-17 22:51:21

解决方案4 -2 2020-06-17 22:24:24

解决方案1
1 2020-06-17 22:34:07

解决方案2
0 已采纳 2020-06-17 22:37:38

解决方案3
0 2020-06-17 22:51:21

解决方案4
-2 2020-06-17 22:24:24