在scikit-learn中適合詞匯方面的問題嗎？

Question

我的目錄充滿了.txt文件（文檔）。 首先，我load文檔並去除一些括號並刪除一些引號，因此文檔看起來如下所示，例如：

document1:
is a scientific discipline that explores the construction and study of algorithms that can learn from data Such algorithms operate by building a model

document2:
Machine learning can be considered a subfield of computer science and statistics It has strong ties to artificial intelligence and optimization which deliver methods

所以我是從這樣的目錄加載文件：

preprocessDocuments =[[' '.join(x) for x in sample[:-1]] for sample in load(directory)]


documents = ''.join( i for i in ''.join(str(v) for v
                                              in preprocessDocuments) if i not in "',()")

然后，我嘗試對document1和document2進行矢量化處理，以創建訓練矩陣，如下所示：

from sklearn.feature_extraction.text import HashingVectorizer
vectorizer = HashingVectorizer(analyzer='word')
X = HashingVectorizer.fit_transform(documents)
X.toarray()

然后是輸出：

    raise ValueError("empty vocabulary; perhaps the documents only"
ValueError: empty vocabulary; perhaps the documents only contain stop words

給定這個，如何創建矢量表示？ 我以為我要在documents攜帶已加載的文件，但似乎無法容納這些文件。

Answer 1

documents的內容是什么？ 看起來應該是文件名或帶有令牌的字符串的列表。 同樣，您應該使用對象調用fit_transform，而不是像靜態方法那樣，即vectorizer.fit_transform(documents) 。

例如，這在這里起作用：

from sklearn.feature_extraction.text import HashingVectorizer
documents=['this is a test', 'another test']
vectorizer = HashingVectorizer(analyzer='word')
X = vectorizer.fit_transform(documents)

在scikit-learn中適合詞匯方面的問題嗎？

問題描述

1 個解決方案

解決方案1
2 2014-12-24 09:37:40

在scikit-learn中適合詞匯方面的問題嗎？

問題描述

1 個解決方案

解決方案1 2 2014-12-24 09:37:40

解決方案1
2 2014-12-24 09:37:40