在python中使用NLTK刪除停用詞

Question

我正在使用 NLTK 從列表元素中刪除停用詞。 這是我的代碼片段

dict1 = {}
    for ctr,row in enumerate(cur.fetchall()):
            list1 = [row[0],row[1],row[2],row[3],row[4]]
            dict1[row[0]] = list1
            print ctr+1,"\n",dict1[row[0]][2]
            list2 = [w for w in dict1[row[0]][3] if not w in stopwords.words('english')]
            print list2

問題是，這不僅刪除了停用詞，而且還從其他單詞中刪除了字符，例如從單詞“orientation”“i”中刪除了更多的停用詞，並且將進一步存儲字符而不是 list2 中的單詞。即 ['O', 'r', 'e', 'n', 'n', ' ', 'f', ' ', '3', ' ', 'r', 'e', 'r' , 'e', ' ', 'p', 'n', '\\n', '\\n', '\\n', 'O', 'r', 'e', 'n', 'n' , ' ', 'f', ' ', 'n', ' ', 'r', 'e', 'r', 'e', ' ', 'r', 'p', 'l'.. .....................雖然我想將其存儲為 ['Orientation','...... ....

Answer 1

首先，確保 list1 是單詞列表，而不是字符數組。 在這里，我可以給你一個代碼片段，你可以利用它。

from nltk import word_tokenize
from nltk.corpus import stopwords

english_stopwords = stopwords.words('english')    # get english stop words

# test document
document = '''A moody child and wildly wise
Pursued the game with joyful eyes
'''

# first tokenize your document to a list of words
words = word_tokenize(document)
print(words)

# the remove all stop words
content = [w for w in words if w.lower() not in english_stopwords]
print(content)

輸出將是：

['A', 'moody', 'child', 'and', 'wildly', 'wise', 'Pursued', 'the', 'game', 'with', 'joyful', 'eyes']
['moody', 'child', 'wildly', 'wise', 'Pursued', 'game', 'joyful', 'eyes']

Answer 2

首先，你對 list1 的構建對我來說有點特殊。 我認為有一個更pythonic的解決方案：

list1 = row[:5]

那么，您是否有理由使用 dict1[row[0]][3] 而不是 row[3] 直接訪問 row[3]？

最后，假設該行是一個字符串列表，從 row[3] 構造 list2 迭代每個字符，而不是每個單詞。 這可能就是您解析 'i' 和 'a'（以及其他一些字符）的原因。

正確的理解應該是：

list2 = [w for w in row[3].split(' ') if w not in stopwords]

你必須以某種方式將你的字符串分開，可能是在空格周圍。 這需要一些東西：

'Hello, this is row3'

到

['Hello,', 'this', 'is', 'row3']

迭代它會給你完整的單詞，而不是單個字符。

在python中使用NLTK刪除停用詞

問題描述

2 個解決方案

解決方案1
3 2016-07-08 20:14:09

解決方案2
0 2016-07-08 20:28:41

在python中使用NLTK刪除停用詞

問題描述

2 個解決方案

解決方案1 3 2016-07-08 20:14:09

解決方案2 0 2016-07-08 20:28:41

解決方案1
3 2016-07-08 20:14:09

解決方案2
0 2016-07-08 20:28:41