稀疏矩阵中行的L2归一化

Question

As I want to use only numpy and scipy (I don't want to use scikit-learn ), I was wondering how to perform a L2 normalization of rows in a huge scipy csc_matrix (2,000,000 x 500,000). 因为我只想使用numpy和scipy （我不想使用scikit-learn ），所以我想知道如何在大型scipy csc_matrix （2,000,000 x 500,000）中执行行的L2归一化。 The operation must consume as little memory as possible since it must fit in memory. 该操作必须占用尽可能少的内存，因为它必须适合内存。

What I have so far is: 到目前为止，我有：

import scipy.sparse as sp

tf_idf_matrix = sp.lil_matrix((n_docs, n_terms), dtype=np.float16)
# ... perform several operations and fill up the matrix

tf_idf_matrix = tf_idf_matrix / l2_norm(tf_idf_matrix)
# l2_norm() is what I want

def l2_norm(sparse_matrix):
    pass

Answer 1

Since I couldn't find the answer anywhere, I will post here how I approached the problem. 由于我在任何地方都找不到答案，因此我将在此处发布如何解决该问题。

def l2_norm(sparse_csc_matrix):
    # first, I convert the csc_matrix to csr_matrix which is done in linear time
    norm = sparse_csc_matrix.tocsr(copy=True)

    # compute the inverse of l2 norm of non-zero elements
    norm.data **= 2
    norm = norm.sum(axis=1)
    n_nzeros = np.where(norm > 0)
    norm[n_nzeros] = 1.0 / np.sqrt(norm[n_nzeros])
    norm = np.array(norm).T[0]

    # modify sparse_csc_matrix in place
    sp.sparsetools.csr_scale_rows(sparse_csc_matrix.shape[0],
                                  sparse_csc_matrix.shape[1],
                                  sparse_csc_matrix.indptr,
                                  sparse_csc_matrix.indices,
                                  sparse_csc_matrix.data, norm)

If anyone has a better approach, please post it. 如果有人有更好的方法，请发布它。

稀疏矩阵中行的L2归一化

问题描述

1 个解决方案

解决方案1
2 已采纳 2014-03-01 23:55:18

稀疏矩阵中行的L2归一化

问题描述

1 个解决方案

解决方案1 2 已采纳 2014-03-01 23:55:18

解决方案1
2 已采纳 2014-03-01 23:55:18