I am using the following python program to remove stopwords from the texts.
import re
from sklearn.feature_extraction import text
mylist= [['an_undergraduate'], ['state_of_the_art', 'terminology']]
######Remove stops
stops = list(text.ENGLISH_STOP_WORDS)
pattern = re.compile(r'|'.join([r'(\_|\b){}\b'.format(x) for x in stops]))
for k in mylist:
for idx, item in enumerate(k):
if item not in stops:
item = pattern.sub('', item).strip()
k[idx] = item
I want the output as
mylist= [['undergraduate'], ['state_art', 'terminology']]
However, the pattern I have mentioned does not capture the stop words properly. Please let me know how to fix this?
If you check the sourcecode of sklearn.feature_extraction.text.ENGLISH_STOP_WORDS
, it is of type frozenset
. Hence, no need to type-cast it to list
. Instead of using regex
, using this nested list comprehension expression will be much more performance efficient.
>>> from sklearn.feature_extraction import text
>>> mylist= [['an_undergraduate'], ['state_of_the_art', 'terminology']]
>>> [['_'.join([w for w in i.split('_') if w not in text.ENGLISH_STOP_WORDS]) for i in e] for e in mylist]
[['undergraduate'], ['state_art', 'terminology']]
Here I am firstly splitting the words based on underscore, checking whether the word is present in the ENGLISH_STOP_WORDS
, and filtering the words for new string only if it is not present.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.