簡體   English   中英

在scikit-learn中適合詞匯方面的問題嗎?

[英]Problems fitting vocabulary in scikit-learn?

我的目錄充滿了.txt文件(文檔)。 首先,我load文檔並去除一些括號並刪除一些引號,因此文檔看起來如下所示,例如:

document1:
is a scientific discipline that explores the construction and study of algorithms that can learn from data Such algorithms operate by building a model

document2:
Machine learning can be considered a subfield of computer science and statistics It has strong ties to artificial intelligence and optimization which deliver methods

所以我是從這樣的目錄加載文件:

preprocessDocuments =[[' '.join(x) for x in sample[:-1]] for sample in load(directory)]


documents = ''.join( i for i in ''.join(str(v) for v
                                              in preprocessDocuments) if i not in "',()")

然后,我嘗試對document1document2進行矢量化處理,以創建訓練矩陣,如下所示:

from sklearn.feature_extraction.text import HashingVectorizer
vectorizer = HashingVectorizer(analyzer='word')
X = HashingVectorizer.fit_transform(documents)
X.toarray()

然后是輸出:

    raise ValueError("empty vocabulary; perhaps the documents only"
ValueError: empty vocabulary; perhaps the documents only contain stop words

給定這個,如何創建矢量表示? 我以為我要在documents攜帶已加載的文件,但似乎無法容納這些文件。

documents的內容是什么? 看起來應該是文件名或帶有令牌的字符串的列表。 同樣,您應該使用對象調用fit_transform,而不是像靜態方法那樣,即vectorizer.fit_transform(documents)

例如,這在這里起作用:

from sklearn.feature_extraction.text import HashingVectorizer
documents=['this is a test', 'another test']
vectorizer = HashingVectorizer(analyzer='word')
X = vectorizer.fit_transform(documents)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM