為什么當文檔說它接受兩者時，稀疏矩陣的 pandas Dataframe 版本不能與 imblearn 的 RandomOverSampler 一起使用？

Question

度過了痛苦的一夜調試

import pandas as pd
from imblearn.over_sampling import RandomOverSampler


x_trainvec_rand, y_train_rand = RandomOverSampler(random_state=0).fit_resample(pd.DataFrame.sparse.from_spmatrix(x_trainvec), y_train)   

print(x_trainvec_rand)

其中 x_trainvec 是 csr 稀疏矩陣，y_train 是 pandas Dataframe，Dataframes 中兩者的尺寸分別為 (75060 x 52651) 和 (75060 x 1)，錯誤為 02ValueError:指數暗示 (290210, 52651)'。

當我突然決定嘗試

import pandas as pd
from imblearn.over_sampling import RandomOverSampler


x_trainvec_rand, y_train_rand = RandomOverSampler(random_state=0).fit_resample(x_trainvec, y_train)   

print(x_trainvec_rand)

不知何故，它奏效了。

關於為什么的任何想法？

文檔說：

fit_resample(X, y)[source]
Resample the dataset.

Parameters
X : {array-like, dataframe, sparse matrix} of shape (n_samples, n_features)
Matrix containing the data which have to be sampled.

y : array-like of shape (n_samples,)
Corresponding label for each sample in X.

Answer 1

文件說它接受

X : {array-like, dataframe, sparse matrix}

那是sparse matrix ，而不是稀疏 dataframe。 在imbalaced-learn源代碼中，我發現稀疏類型必須是csr或csr的測試，但無法進行進一步處理。

但是讓我們看看 pandas 稀疏。

稀疏矩陣：

In [105]: M = sparse.csr_matrix(np.eye(3))
In [106]: M
Out[106]: 
<3x3 sparse matrix of type '<class 'numpy.float64'>'
    with 3 stored elements in Compressed Sparse Row format>
In [107]: print(M)
  (0, 0)    1.0
  (1, 1)    1.0
  (2, 2)    1.0

派生出來的dataframe：

In [108]: df = pd.DataFrame.sparse.from_spmatrix(M)
In [109]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype             
---  ------  --------------  -----             
 0   0       3 non-null      Sparse[float64, 0]
 1   1       3 non-null      Sparse[float64, 0]
 2   2       3 non-null      Sparse[float64, 0]
dtypes: Sparse[float64, 0](3)
memory usage: 164.0 bytes
In [110]: df[1]
Out[110]: 
0    0.0
1    1.0
2    0.0
Name: 1, dtype: Sparse[float64, 0]
In [111]: df[1].values
Out[111]: 
[0, 1.0, 0]
Fill: 0
IntIndex
Indices: array([1], dtype=int32)

稀疏 dataframe 存儲與稀疏矩陣完全不同。 這不是兩個類的簡單合並。

我可能應該堅持查看錯誤的完整回溯，

 ValueError: Shape of passed values is (290210, 1), indices imply (290210, 52651)

至少它可能會讓我們/您了解它正在嘗試做什么。 但另一方面，關注文檔的實際內容，而不是您想要它說的內容，就足夠了。

為什么當文檔說它接受兩者時，稀疏矩陣的 pandas Dataframe 版本不能與 imblearn 的 RandomOverSampler 一起使用？

問題描述

1 個解決方案

解決方案1
0 2022-08-21 15:45:23

為什么當文檔說它接受兩者時，稀疏矩陣的 pandas Dataframe 版本不能與 imblearn 的 RandomOverSampler 一起使用？

問題描述

1 個解決方案

解決方案1 0 2022-08-21 15:45:23

解決方案1
0 2022-08-21 15:45:23