如何在熊貓數據幀上使用 tfidfvectorizer？

Question

拆分訓練和測試數據后，我想在熊貓上使用 sklearn TFIdfVectorizer

這是拆分數據的代碼：

train = data_df
    train_df,test_df= train_test_split(train,test_size=0.2)

我嘗試使用 TFIdfVectorizer 函數：

start = time.clock()
vect = CountVectorizer(ngram_range=(2,2))
train_df = vect.fit_transform(train_df)
test_df = vect.transform(test_df)

print (time.clock()-start)

但它出現了這樣的錯誤：

ValueError                                Traceback (most recent call last)
<ipython-input-36-3588531e9fc6> in <module>
      3 vect = CountVectorizer(ngram_range=(2,2))
      4 #converting traning features into numeric vector
----> 5 train_df = vect.fit_transform(train_df)
      6 #converting training labels into numeric vector
      7 test_df = vect.transform(test_df)

~\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self, raw_documents, y)
   1218 
   1219         vocabulary, X = self._count_vocab(raw_documents,
-> 1220                                           self.fixed_vocabulary_)
   1221 
   1222         if self.binary:

~\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in _count_vocab(self, raw_documents, fixed_vocab)
   1148             vocabulary = dict(vocabulary)
   1149             if not vocabulary:
-> 1150                 raise ValueError("empty vocabulary; perhaps the documents only"
   1151                                  " contain stop words")
   1152 

ValueError: empty vocabulary; perhaps the documents only contain stop words

有什么我想念的嗎？ 或任何解決方案來解決這個問題？ 謝謝

Answer 1

問題似乎是您的記錄可能包含單個字符串。 嘗試將它們轉換為列表或將最小文檔頻率設置為 1。請查看下面給出的鏈接，它會產生您想要的結果：

值錯誤：空詞匯； 也許文檔只包含停用詞

如何在熊貓數據幀上使用 tfidfvectorizer？

問題描述

1 個解決方案

解決方案1
0 2020-09-15 05:23:16

如何在熊貓數據幀上使用 tfidfvectorizer？

問題描述

1 個解決方案

解決方案1 0 2020-09-15 05:23:16

解決方案1
0 2020-09-15 05:23:16