[英]Nltk: Eliminating stop words from list of list
我試圖刪除停用詞並嘗試了以下方法:
tokenizer = RegexpTokenizer(r'\w+')
tokenized = data['data_column'].apply(tokenizer.tokenize)
tokenized
標記化后低於輸出
0 [ANOTHER,SAMPLE,AS,OUTPUT,MSG...
1 [A,SAMPLE,TEXT,FOR,ILLUSTRATION...
Name: data_column, dtype: object
我嘗試使用以下方法刪除停用詞:
stop_words = set(stopwords.words('english'))
filtered_sentence = [w for w in tokenized if not w in stop_words]
filtered_sentence = []
for w in tokenized:
if w not in stop_words:
filtered_sentence.append(w)
我得到錯誤:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-272-d4a699384ffc> in <module>()
2 stop_words = set(stopwords.words('english'))
3
----> 4 filtered_sentence = [w for w in tokenized if not w in stop_words]
5
6 filtered_sentence = []
TypeError: unhashable type: 'list'
您需要.apply()
從一系列列表中過濾列表,因為語料庫包含小寫單詞,因此您需要在搜索之前使用.lower()
stop_words = set(stopwords.words('english'))
filtered_sentence = tokenized.apply(lambda x : [w for w in x if w.lower() not in stop_words])
樣品運行
from nltk.corpus import stopwords
stop = set(stopwords.words('english'))
df = pd.DataFrame({'words': [['A','SAMPLE','AS','OUTPUT','MSG']]})
df['words'].apply(lambda x : [i for i in x if not i.lower() in stop])
0 [SAMPLE, OUTPUT, MSG]
Name: words, dtype: object
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.