简体   繁体   中英

Remove a list of phrase from string

I have a list of phrases (n-grams) that need to be removed from a given sentence.

    removed = ['range', 'drinks', 'food and drinks', 'summer drinks']
    sentence = 'Oranges are the main ingredient for a wide range of food and drinks'

I want to get:

    new_sentence = 'Oranges are the main ingredient for a wide of'

I tried Remove list of phrases from string but it doesn't work ('Oranges' turns into 'Os', 'drinks' is removed instead of a phrase 'food and drinks')

Does anyone know how to solve it? Thank you!

Since you want to match on whole words only, I think the first step is to turn everything into lists of words, and then iterate from longest to shortest phrase in order to find things to remove:

>>> removed = ['range', 'drinks', 'food and drinks', 'summer drinks']
>>> sentence = 'Oranges are the main ingredient for a wide range of food and drinks'
>>> words = sentence.split()
>>> for ngram in sorted([r.split() for r in removed], key=len, reverse=True):
...     for i in range(len(words) - len(ngram)+1):
...         if words[i:i+len(ngram)] == ngram:
...             words = words[:i] + words[i+len(ngram):]
...             break
...
>>> " ".join(words)
'Oranges are the main ingredient for a wide of'

Note that there are some flaws with this simple approach -- multiple copies of the same n-gram won't be removed, but you can't continue with that loop after modifying words either (the length will be different), so if you want to handle duplicates, you'll need to batch the updates.

Regular expression time!

In [116]: removed = ['range', 'drinks', 'food and drinks', 'summer drinks']
     ...: removed = sorted(removed, key=len, reverse=True)
     ...: sentence = 'Oranges are the main ingredient for a wide range of food and drinks'
     ...: new_sentence = sentence
     ...: import re
     ...: removals = [r'\b' + phrase + r'\b' for phrase in removed]
     ...: for removal in removals:
     ...:     new_sentence = re.sub(removal, '', new_sentence)
     ...: new_sentence = ' '.join(new_sentence.split())
     ...: print(sentence)
     ...: print(new_sentence)
Oranges are the main ingredient for a wide range of food and drinks
Oranges are the main ingredient for a wide of
    import re

    removed = ['range', 'drinks', 'food and drinks', 'summer drinks']
    sentence = 'Oranges are the main ingredient for a wide range of food and drinks'

    # sort the removed tokens according to their length,
    removed = sorted(removed, key=len, reverse=True)

    # using word boundaries
    for r in removed:
        sentence = re.sub(r"\b{}\b".format(r), " ", sentence)

    # replace multiple whitspaces with a single one   
    sentence = re.sub(' +',' ',sentence)

I hope this would help: first, you need to sort the removed strings according to their length, in this way 'food and drinks' will be replaced before 'drinks'

Here you go

removed = ['range', 'drinks', 'food and drinks', 'summer drinks','are']
sentence = 'Oranges are the main ingredient for a wide range of food and drinks'

words = sentence.split()
resultwords  = [word for word in words if word.lower() not in removed]
result = ' '.join(resultwords)
print(result)

Results:

Oranges the main ingredient for a wide of food and

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM