简体   繁体   中英

Remove words from the string which are present in list

I am using the following python program to remove stopwords from the texts.

import re
from sklearn.feature_extraction import text

mylist= [['an_undergraduate'], ['state_of_the_art', 'terminology']]
######Remove stops
stops = list(text.ENGLISH_STOP_WORDS)
pattern = re.compile(r'|'.join([r'(\_|\b){}\b'.format(x) for x in stops]))
for k in mylist:
    for idx, item in enumerate(k):
        if item not in stops:
            item = pattern.sub('', item).strip()
            k[idx] = item

I want the output as

mylist= [['undergraduate'], ['state_art', 'terminology']]

However, the pattern I have mentioned does not capture the stop words properly. Please let me know how to fix this?

If you check the sourcecode of sklearn.feature_extraction.text.ENGLISH_STOP_WORDS , it is of type frozenset . Hence, no need to type-cast it to list . Instead of using regex , using this nested list comprehension expression will be much more performance efficient.

>>> from sklearn.feature_extraction import text
>>> mylist= [['an_undergraduate'], ['state_of_the_art', 'terminology']]

>>> [['_'.join([w for w in i.split('_') if w not in text.ENGLISH_STOP_WORDS]) for i in e] for e in mylist]
[['undergraduate'], ['state_art', 'terminology']]

Here I am firstly splitting the words based on underscore, checking whether the word is present in the ENGLISH_STOP_WORDS , and filtering the words for new string only if it is not present.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM