數數令牌化后的令牌數量，停用詞刪除和詞干

Question

我有以下功能：

def preprocessText (data):
    stemmer = nltk.stem.porter.PorterStemmer()
    preprocessed = []
    for each in data:
        tokens = nltk.word_tokenize(each.lower().translate(string.punctuation))
        filtered = [word for word in tokens if word not in nltk.corpus.stopwords.words('english')]
        preprocessed.append([stemmer.stem(item) for item in filtered])
    print(Counter(tokens).most_common(10))
    return (np.array(preprocessed))

這應該刪除標點符號，標記化，刪除停用詞並使用Porter Stemmer詞干。 但是，它不能正常工作。 例如，當我運行以下代碼時：

s = ["The cow and of.", "and of dog the."]
print (Counter(preprocessText(s)))

它產生以下輸出：

[('and', 1), ('.', 1), ('dog', 1), ('the', 1), ('of', 1)]

不會刪除標點符號或停用詞。

Answer 1

您的翻譯無法刪除標點符號。 這是一些工作代碼。 我進行了一些更改，其中最重要的是：

碼：

xlate = {ord(x): y for x, y in
         zip(string.punctuation, ' ' * len(string.punctuation))}
tokens = nltk.word_tokenize(each.lower().translate(xlate))

測試代碼：

from collections import Counter
import nltk
import string

stopwords = set(nltk.corpus.stopwords.words('english'))
try:
    # python 2
    xlate = string.maketrans(
        string.punctuation, ' ' * len(string.punctuation))
except AttributeError:
    xlate = {ord(x): y for x, y in
             zip(string.punctuation, ' ' * len(string.punctuation))}

def preprocessText(data):
    stemmer = nltk.stem.porter.PorterStemmer()
    preprocessed = []
    for each in data:
        tokens = nltk.word_tokenize(each.lower().translate(xlate))
        filtered = [word for word in tokens if word not in stopwords]
        preprocessed.append([stemmer.stem(item) for item in filtered])
    return np.array(preprocessed)

s = ["The cow and of.", "and of dog the."]
print(Counter(sum([list(x) for x in preprocessText(s)], [])))

結果：

Counter({'dog': 1, 'cow': 1})

Answer 2

問題是您誤用了translate 。 為了正確使用它，您需要制作一個映射表（如幫助字符串將告訴您），將“ Unicode序號映射到Unicode序號，字符串或無”。 例如，像這樣：

>>> mapping = dict((ord(x), None) for x in string.punctuation)  # `None` means "delete"
>>> print("This.and.that".translate(mapping))
'Thisandthat'

但是，如果您對單詞標記執行此操作，則將標點符號替換為空字符串。 您可以添加一個步驟來擺脫它們，但是我建議您僅選擇想要的內容：字母數字單詞。

tokens = nltk.word_tokenize(each.lower() if each.isalnum())

這就是您需要更改代碼的全部。

數數令牌化后的令牌數量，停用詞刪除和詞干

問題描述

2 個解決方案

解決方案1
2 已采納 2017-06-04 17:24:52

碼：

測試代碼：

結果：

解決方案2
0 2017-06-05 09:15:47

數數 令牌化后的令牌數量，停用詞刪除和詞干

問題描述

2 個解決方案

解決方案1 2 已采納 2017-06-04 17:24:52

碼：

測試代碼：

結果：

解決方案2 0 2017-06-05 09:15:47

數數令牌化后的令牌數量，停用詞刪除和詞干

解決方案1
2 已采納 2017-06-04 17:24:52

解決方案2
0 2017-06-05 09:15:47