Scikit學習朴素貝葉斯ValueError：尺寸不匹配

Question

我正在使用Scikit學習中的朴素貝葉斯分類器。

在訓練和預測階段，我都使用以下代碼從元組列表中獲取csr_matrix：

def convert_to_csr_matrix(vectors):
    """
    convert list of tuples representation to scipy csr_matrix that is needed
    for scikit learner
    """
    logger.info("building the csr_sparse matrix representing tf-idf")
    row = [[i] * len(v) for i, v in enumerate(vectors)]
    row = list(chain(*row))
    column = [j for j, _ in chain(*vectors)]
    data = [d for _, d in chain(*vectors)]
    return csr_matrix((data, (row, column)))

我主要是根據scipy csr_matrix從代表集列表的幾個向量中實現的

不幸的是，現在在預測階段出現以下錯誤：

File "/Users/zikes/project/taxonomy_data_preprocessing/single_classification.py", line 93, in predict
top_predictions = self.top.predict(item)
File "/Users/zikes/project/taxonomy_data_preprocessing/single_classification.py", line 124, in predict
category, res = model.predict(item)
File "/Users/zikes/project/taxonomy_data_preprocessing/single_classification.py", line 176, in predict
prediction = self.clf.predict(item)
File "/Users/zikes/.virtualenvs/taxonomy/lib/python2.7/site-packages/sklearn/naive_bayes.py", line 64, in predict
jll = self._joint_log_likelihood(X)
File "/Users/zikes/.virtualenvs/taxonomy/lib/python2.7/site-packages/sklearn/naive_bayes.py", line 615, in _joint_log_likelihood
return (safe_sparse_dot(X, self.feature_log_prob_.T)
File "/Users/zikes/.virtualenvs/taxonomy/lib/python2.7/site-packages/sklearn/utils/extmath.py", line 178, in safe_sparse_dot
ret = a * b
File "/Users/zikes/.virtualenvs/taxonomy/lib/python2.7/site-packages/scipy/sparse/base.py", line 354, in __mul__
raise ValueError('dimension mismatch')
ValueError: dimension mismatch

有人知道什么地方可能出問題嗎？ 我猜稀疏向量的大小不正確。 但是我不明白為什么？

在調試期間，我已經從Naive Bayes模型的日志中打印出提及的feature_log_prob_日志，它看起來像：

[[-11.82052115 -12.51735721 -12.51735721 ..., -12.51735721 -11.60489688
-12.2132116 ]
[-12.21403023 -12.51130295 -12.51130295 ..., -11.84156341 -12.51130295
-12.51130295]]

和shape ： (2, 53961)

我要預測的csr_matrix = (0, 7637) 0.770238101052 (0, 21849) 0.637756432886

並表示為元組列表，它看起來如下： [(7637, 0.7702381010520318), (21849, 0.6377564328862234)]

Answer 1

因此，在對問題進行了一些調查之后，我意識到可能的修復方法可能是：

def convert_to_csr_matrix(vectors):
   """
   convert list of tuples representation to scipy csr_matrix that is needed
   for scikit learner
   """
   logger.info("building the csr_sparse matrix representing tf-idf")
   row = [[i] * len(v) for i, v in enumerate(vectors)]
   row = list(chain(*row))
   column = [j for j, _ in chain(*vectors)]
   data = [d for _, d in chain(*vectors)]
   return csr_matrix((data, (row, column)))

行return csr_matrix((data, (row, column)))應替換為return csr_matrix((data, (row, column)), shape=(len(vectors), dimension))

Scikit學習朴素貝葉斯ValueError：尺寸不匹配

問題描述

1 個解決方案

解決方案1
0 已采納 2015-08-05 15:12:56

Scikit學習朴素貝葉斯ValueError：尺寸不匹配

問題描述

1 個解決方案

解決方案1 0 已采納 2015-08-05 15:12:56

解決方案1
0 已采納 2015-08-05 15:12:56