![](/img/trans.png)
[英]How to use sklearn TFIdfVectorizer on pandas dataframe
[英]How to tfidfvectorizer on pandas dataframe?
拆分訓練和測試數據后,我想在熊貓上使用 sklearn TFIdfVectorizer
這是拆分數據的代碼:
train = data_df
train_df,test_df= train_test_split(train,test_size=0.2)
我嘗試使用 TFIdfVectorizer 函數:
start = time.clock()
vect = CountVectorizer(ngram_range=(2,2))
train_df = vect.fit_transform(train_df)
test_df = vect.transform(test_df)
print (time.clock()-start)
但它出現了這樣的錯誤:
ValueError Traceback (most recent call last)
<ipython-input-36-3588531e9fc6> in <module>
3 vect = CountVectorizer(ngram_range=(2,2))
4 #converting traning features into numeric vector
----> 5 train_df = vect.fit_transform(train_df)
6 #converting training labels into numeric vector
7 test_df = vect.transform(test_df)
~\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self, raw_documents, y)
1218
1219 vocabulary, X = self._count_vocab(raw_documents,
-> 1220 self.fixed_vocabulary_)
1221
1222 if self.binary:
~\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in _count_vocab(self, raw_documents, fixed_vocab)
1148 vocabulary = dict(vocabulary)
1149 if not vocabulary:
-> 1150 raise ValueError("empty vocabulary; perhaps the documents only"
1151 " contain stop words")
1152
ValueError: empty vocabulary; perhaps the documents only contain stop words
有什么我想念的嗎? 或任何解決方案來解決這個問題? 謝謝
問題似乎是您的記錄可能包含單個字符串。 嘗試將它們轉換為列表或將最小文檔頻率設置為 1。請查看下面給出的鏈接,它會產生您想要的結果:
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.