簡體   English   中英

如何在熊貓數據幀上使用 tfidfvectorizer?

[英]How to tfidfvectorizer on pandas dataframe?

拆分訓練和測試數據后,我想在熊貓上使用 sklearn TFIdfVectorizer

這是拆分數據的代碼:

train = data_df
    train_df,test_df= train_test_split(train,test_size=0.2)

我嘗試使用 TFIdfVectorizer 函數:

start = time.clock()
vect = CountVectorizer(ngram_range=(2,2))
train_df = vect.fit_transform(train_df)
test_df = vect.transform(test_df)

print (time.clock()-start)

但它出現了這樣的錯誤:

ValueError                                Traceback (most recent call last)
<ipython-input-36-3588531e9fc6> in <module>
      3 vect = CountVectorizer(ngram_range=(2,2))
      4 #converting traning features into numeric vector
----> 5 train_df = vect.fit_transform(train_df)
      6 #converting training labels into numeric vector
      7 test_df = vect.transform(test_df)

~\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self, raw_documents, y)
   1218 
   1219         vocabulary, X = self._count_vocab(raw_documents,
-> 1220                                           self.fixed_vocabulary_)
   1221 
   1222         if self.binary:

~\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in _count_vocab(self, raw_documents, fixed_vocab)
   1148             vocabulary = dict(vocabulary)
   1149             if not vocabulary:
-> 1150                 raise ValueError("empty vocabulary; perhaps the documents only"
   1151                                  " contain stop words")
   1152 

ValueError: empty vocabulary; perhaps the documents only contain stop words

有什么我想念的嗎? 或任何解決方案來解決這個問題? 謝謝

問題似乎是您的記錄可能包含單個字符串。 嘗試將它們轉換為列表或將最小文檔頻率設置為 1。請查看下面給出的鏈接,它會產生您想要的結果:

值錯誤:空詞匯; 也許文檔只包含停用詞

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM