如何從Pandas數據框文本列中使用NLTK語料庫刪除英語停用詞？

Question

我正在尋找一個解決方案，在Pandas數據幀文本列上使用NLTK語料庫刪除英語停用詞。 我們可以使用數據幀應用方法，如果是，那么請分享一下嗎？

stop_words = set(stopwords.words('english'))
data['text'] = data['text'].apply(lambda text:  " ".join(w) for w in text.lower().split() if w not in stop_words)

如果有人能回答，謝謝並感激。

Answer 1

您可以將文本列標記（或簡單地拆分為單詞列表），然后使用map或apply方法刪除停用詞。

例如：

data = pd.DataFrame({'text': ['a sentence can have stop words', 'stop words are common words like if, I, you, a, etc...']})
data
                                                text
0                     a sentence can have stop words
1  stop words are common words like if, I, you, a...

from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer('\w+')
stop_words = stopwords.words('english')

def clean(x):
    doc = tokenizer.tokenize(x.lower())
    return [w for w in doc if w in stop_words]

data.text.map(clean)
0                    [sentence, stop, words]
1    [stop, words, common, words, like, etc]
Name: text, dtype: object

如何從Pandas數據框文本列中使用NLTK語料庫刪除英語停用詞？

問題描述

1 個解決方案

解決方案1
0 2019-06-12 11:43:41

如何從Pandas數據框文本列中使用NLTK語料庫刪除英語停用詞？

問題描述

1 個解決方案

解決方案1 0 2019-06-12 11:43:41

解決方案1
0 2019-06-12 11:43:41