[英]How can I remove English stop words using NLTK corpus from the Pandas dataframe text column?
I am looking for a solution to remove the English stop words using NLTK corpus on a Pandas dataframe text column. 我正在寻找一个解决方案,在Pandas数据帧文本列上使用NLTK语料库删除英语停用词。 Can we do it with the dataframe apply method, if yes, then please share it?
我们可以使用数据帧应用方法,如果是,那么请分享一下吗?
stop_words = set(stopwords.words('english'))
data['text'] = data['text'].apply(lambda text: " ".join(w) for w in text.lower().split() if w not in stop_words)
Thanks and appreciate it if someone can answer it. 如果有人能回答,谢谢并感激。
You could tokenize your text column (or simply split into a list of words) and then remove the stop words using the map
or apply
method. 您可以将文本列标记(或简单地拆分为单词列表),然后使用
map
或apply
方法删除停用词。
For example: 例如:
data = pd.DataFrame({'text': ['a sentence can have stop words', 'stop words are common words like if, I, you, a, etc...']})
data
text
0 a sentence can have stop words
1 stop words are common words like if, I, you, a...
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer('\w+')
stop_words = stopwords.words('english')
def clean(x):
doc = tokenizer.tokenize(x.lower())
return [w for w in doc if w in stop_words]
data.text.map(clean)
0 [sentence, stop, words]
1 [stop, words, common, words, like, etc]
Name: text, dtype: object
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.