如何计算 Scipy 中稀疏矩阵列的方差？

Question

I have a large scipy.sparse.csc_matrix and would like to normalize it.我有一个很大的scipy.sparse.csc_matrix并且想对其进行规范化。 That is subtract the column mean from each element and divide by the column standard deviation (std)i.即从每个元素中减去列平均值并除以列标准偏差 (std)i。

scipy.sparse.csc_matrix has a .mean() but is there an efficient way to compute the variance or std? scipy.sparse.csc_matrix有一个scipy.sparse.csc_matrix .mean()但是有没有一种有效的方法来计算方差或标准差？

Answer 1

You can calculate the variance yourself using the mean, with the following formula :您可以使用均值自行计算方差，公式如下：

E[X^2] - (E[X])^2

E[X] stands for the mean. E[X]代表平均值。 So to calculate E[X^2] you would have to square the csc_matrix and then use the mean function.因此，要计算E[X^2]您必须对csc_matrix求平方，然后使用mean函数。 To get (E[X])^2 you simply need to square the result of the mean function obtained using the normal input.要获得(E[X])^2您只需对使用正常输入获得的mean函数的结果求平方即可。

Answer 2

Sicco has the better answer. Sicco 有更好的答案。

However, another way is to convert the sparse matrix to a dense numpy array one column at a time (to keep the memory requirements lower compared to converting the whole matrix at once):但是，另一种方法是将稀疏矩阵一次一列转换为密集的 numpy 数组（与一次转换整个矩阵相比，内存要求较低）：

# mat is the sparse matrix
# Get the number of columns
cols = mat.shape[1]
arr = np.empty(shape=cols)
for i in range(cols):
    arr[i] = np.var(mat[:, i].toarray())

Answer 3

The most efficient way I know of is to use StandardScalar from scikit :我所知道的最有效的方法是使用scikit StandardScalar ：

from sklearn.preprocessing import StandardScaler


scalar = StandardScaler(with_mean=False)
scalar.fit(X)

Then the variances are in the attribute var_ :然后方差在属性var_ ：

X_var = scalar.var_

The curious thing though, is that when I densified first using pandas (which is very slow) my answer was off by a few percent.不过，奇怪的是，当我第一次使用pandas加密时（非常慢），我的回答相差了几个百分点。 I don't know which is more accurate.不知道哪个更准确。

Answer 4

The efficient way is actually to densify the entire matrix, then standardize it in the usual way with有效的方法实际上是对整个矩阵进行致密化，然后以通常的方式对其进行标准化

X = X.toarray()
X -= X.mean()
X /= X.std()

As @Sebastian has noted in his comments, standardizing destroys the sparsity structure (introduces lots of non-zero elements) in the subtraction step, so there's no use keeping the matrix in a sparse format.正如@Sebastian 在他的评论中指出的那样，标准化会在减法步骤中破坏稀疏结构（引入大量非零元素），因此将矩阵保持为稀疏格式是没有用的。

如何计算 Scipy 中稀疏矩阵列的方差？

问题描述

4 个解决方案

解决方案1
14 已采纳 2012-08-29 09:31:42

解决方案2
0 2020-01-27 11:28:24

解决方案3
0 2021-02-09 07:52:25

解决方案4
-3 2012-08-29 12:16:33

如何计算 Scipy 中稀疏矩阵列的方差？

问题描述

4 个解决方案

解决方案1 14 已采纳 2012-08-29 09:31:42

解决方案2 0 2020-01-27 11:28:24

解决方案3 0 2021-02-09 07:52:25

解决方案4 -3 2012-08-29 12:16:33

解决方案1
14 已采纳 2012-08-29 09:31:42

解决方案2
0 2020-01-27 11:28:24

解决方案3
0 2021-02-09 07:52:25

解决方案4
-3 2012-08-29 12:16:33