![](/img/trans.png)
[英]How do I remove english stop words from a dataframe column using a custom dictionary of stop words
[英]How can I remove English stop words using NLTK corpus from the Pandas dataframe text column?
我正在尋找一個解決方案,在Pandas數據幀文本列上使用NLTK語料庫刪除英語停用詞。 我們可以使用數據幀應用方法,如果是,那么請分享一下嗎?
stop_words = set(stopwords.words('english'))
data['text'] = data['text'].apply(lambda text: " ".join(w) for w in text.lower().split() if w not in stop_words)
如果有人能回答,謝謝並感激。
您可以將文本列標記(或簡單地拆分為單詞列表),然后使用map
或apply
方法刪除停用詞。
例如:
data = pd.DataFrame({'text': ['a sentence can have stop words', 'stop words are common words like if, I, you, a, etc...']})
data
text
0 a sentence can have stop words
1 stop words are common words like if, I, you, a...
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer('\w+')
stop_words = stopwords.words('english')
def clean(x):
doc = tokenizer.tokenize(x.lower())
return [w for w in doc if w in stop_words]
data.text.map(clean)
0 [sentence, stop, words]
1 [stop, words, common, words, like, etc]
Name: text, dtype: object
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.