Nltk：從列表列表中刪除停用詞

Question

我試圖刪除停用詞並嘗試了以下方法：

tokenizer = RegexpTokenizer(r'\w+')
tokenized = data['data_column'].apply(tokenizer.tokenize)
tokenized

標記化后低於輸出

0    [ANOTHER,SAMPLE,AS,OUTPUT,MSG...
1    [A,SAMPLE,TEXT,FOR,ILLUSTRATION...
Name: data_column, dtype: object

我嘗試使用以下方法刪除停用詞：

stop_words = set(stopwords.words('english'))
filtered_sentence = [w for w in tokenized if not w in stop_words]
filtered_sentence = []
 for w in tokenized:
    if w not in stop_words:
        filtered_sentence.append(w)

我得到錯誤：

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-272-d4a699384ffc> in <module>()
      2 stop_words = set(stopwords.words('english'))
      3 
----> 4 filtered_sentence = [w for w in tokenized if not w in stop_words]
      5 
      6 filtered_sentence = []

TypeError: unhashable type: 'list'

Answer 1

您需要.apply()從一系列列表中過濾列表，因為語料庫包含小寫單詞，因此您需要在搜索之前使用.lower（）

stop_words = set(stopwords.words('english'))
filtered_sentence = tokenized.apply(lambda x : [w for w in x if w.lower() not in stop_words])

樣品運行

from nltk.corpus import stopwords
stop = set(stopwords.words('english'))

df = pd.DataFrame({'words': [['A','SAMPLE','AS','OUTPUT','MSG']]})
df['words'].apply(lambda x : [i for i in x if not i.lower() in stop])

0    [SAMPLE, OUTPUT, MSG]
Name: words, dtype: object

Nltk：從列表列表中刪除停用詞

問題描述

1 個解決方案

解決方案1
4 已采納 2017-10-24 15:35:17

Nltk：從列表列表中刪除停用詞

問題描述

1 個解決方案

解決方案1 4 已采納 2017-10-24 15:35:17

解決方案1
4 已采納 2017-10-24 15:35:17