word_tokenize 代碼相同，數據集相同，結果不同，為什么？

Question

上個月，我嘗試標記文本並創建一個單詞，以查看哪個單詞經常出現。 今天，我想用相同的代碼在同一個數據集中再做一次。 它仍然有效，但結果不同，顯然今天的結果是錯誤的，因為出現單詞的頻率顯着下降。

這是我的代碼：

from nltk.tokenize import sent_tokenize, word_tokenize
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from nltk.stem import WordNetLemmatizer
import nltk
from collections import Counter

sent = nltk.word_tokenize(str(df.description))
lower_token = [t.lower() for t in sent]
alpha = [t for t in lower_token if t.isalpha()]
stop_word =  [t for t in alpha if t not in ENGLISH_STOP_WORDS]
k = WordNetLemmatizer()
lemma = [k.lemmatize(t) for t in stop_word]
bow = Counter(lemma)
print(bow.most_common(20))

這是我的數據集示例

這個數據集來自 Kaggle，它的名字是“Wine Reviews”。

Answer 1

歡迎使用 StackOverflow。

您的問題可能有兩個原因。

1）可能是您修改了數據集。 為此，我會檢查數據集，看看您是否對數據本身進行了任何更改。 因為您的代碼適用於其他示例，並且不會每天更改，因為它沒有隨機元素。

2）第二個問題可能是您在此行中調用數據df.description列時使用df.description ：

sent = nltk.word_tokenize(str(df.description))

你得到一個截斷的輸出。 查看df.description的類型，它是一個Series對象。

我創建了另一個示例，如下所示：

from nltk.tokenize import word_tokenize
import pandas as pd

df = pd.DataFrame({'description' : ['The OP is asking a question and I referred him to the Minimum Verifible Example page which states: When asking a question, people will be better able to provide help if you provide code that they can easily understand and use to reproduce the problem. This is referred to by community members as creating a minimal, reproducible example (reprex), a minimal, complete and verifiable example (mcve), or a minimal, workable example (mwe). Regardless of how it\'s communicated to you, it boils down to ensuring your code that reproduces the problem follows the following guidelines:']})


print(df.description)

0    The OP is asking a question and I referred him...
Name: description, dtype: object

正如你在上面看到的，它被截斷了，它不是description列中的全文。

我對您的代碼的建議是查看這行代碼並找到一種不同的方法：

sent = nltk.word_tokenize(str(df.description))

請注意，您在代碼中使用的方法將包括索引號（我了解您已通過isalpha過濾）以及您正在處理的數據中的Name: description, dtype: object 。

一種方法是使用map來處理您的數據。 一個例子是：

pd.set_option('display.max_colwidth', -1)
df['tokenized'] = df['description'].map(str).map(nltk.word_tokenize)

繼續為其他操作執行此操作。 一種簡單的方法是構建一個預處理函數，該函數將所有預處理操作（您想要使用的）應用於您的數據幀。

我希望這有幫助。

word_tokenize 代碼相同，數據集相同，結果不同，為什么？

問題描述

1 個解決方案

解決方案1
0 已采納 2020-03-01 21:12:53

word_tokenize 代碼相同，數據集相同，結果不同，為什么？

問題描述

1 個解決方案

解決方案1 0 已采納 2020-03-01 21:12:53

解決方案1
0 已采納 2020-03-01 21:12:53