简体   繁体   English

Pandas sparse dataFrame转稀疏矩阵,内存中不生成稠密矩阵

[英]Pandas sparse dataFrame to sparse matrix, without generating a dense matrix in memory

Is there a way to convert from a pandas.SparseDataFrame to scipy.sparse.csr_matrix , without generating a dense matrix in memory?有没有办法从pandas.SparseDataFrame转换为scipy.sparse.csr_matrix ,而不会在内存中生成密集矩阵?

scipy.sparse.csr_matrix(df.values)

doesn't work as it generates a dense matrix which is cast to the csr_matrix .不起作用,因为它生成了一个密集矩阵,该矩阵被转换为csr_matrix

Thanks in advance!提前致谢!

Pandas 0.20.0+:熊猫 0.20.0+:

As of pandas version 0.20.0, released May 5, 2017, there is a one-liner for this:从 2017 年 5 月 5 日发布的 pandas 0.20.0 版本开始,有一个单行:

from scipy import sparse


def sparse_df_to_csr(df):
    return sparse.csr_matrix(df.to_coo())

This uses the new to_coo() method .这使用了新的to_coo()方法

Earlier Versions:早期版本:

Building on Victor May's answer, here's a slightly faster implementation, but it only works if the entire SparseDataFrame is sparse with all BlockIndex (note: if it was created with get_dummies , this will be the case).基于 Victor May 的回答,这里有一个稍微快一点的实现,但它仅在整个SparseDataFrame与所有BlockIndex稀疏BlockIndex (注意:如果它是用get_dummies创建的,情况就是这样)。

Edit : I modified this so it will work with a non-zero fill value.编辑:我修改了这个,所以它可以使用非零填充值。 CSR has no native non-zero fill value, so you will have to record it externally. CSR 没有本机非零填充值,因此您必须在外部记录它。

import numpy as np
import pandas as pd
from scipy import sparse

def sparse_BlockIndex_df_to_csr(df):
    columns = df.columns
    zipped_data = zip(*[(df[col].sp_values - df[col].fill_value,
                         df[col].sp_index.to_int_index().indices)
                        for col in columns])
    data, rows = map(list, zipped_data)
    cols = [np.ones_like(a)*i for (i,a) in enumerate(data)]
    data_f = np.concatenate(data)
    rows_f = np.concatenate(rows)
    cols_f = np.concatenate(cols)
    arr = sparse.coo_matrix((data_f, (rows_f, cols_f)),
                            df.shape, dtype=np.float64)
    return arr.tocsr()

The answer by @Marigold does the trick, but it is slow due to accessing all elements in each column, including the zeros. @Marigold 的答案可以解决问题,但由于访问每列中的所有元素(包括零),因此速度很慢。 Building on it, I wrote the following quick n' dirty code, which runs about 50x faster on a 1000x1000 matrix with a density of about 1%.在此基础上,我编写了以下快速不脏代码,它在密度约为 1% 的 1000x1000 矩阵上运行速度提高了约 50 倍。 My code also handles dense columns appropriately.我的代码还适当地处理密集列。

def sparse_df_to_array(df):
    num_rows = df.shape[0]   

    data = []
    row = []
    col = []

    for i, col_name in enumerate(df.columns):
        if isinstance(df[col_name], pd.SparseSeries):
            column_index = df[col_name].sp_index
            if isinstance(column_index, BlockIndex):
                column_index = column_index.to_int_index()

            ix = column_index.indices
            data.append(df[col_name].sp_values)
            row.append(ix)
            col.append(len(df[col_name].sp_values) * [i])
        else:
            data.append(df[col_name].values)
            row.append(np.array(range(0, num_rows)))
            col.append(np.array(num_rows * [i]))

    data_f = np.concatenate(data)
    row_f = np.concatenate(row)
    col_f = np.concatenate(col)

    arr = coo_matrix((data_f, (row_f, col_f)), df.shape, dtype=np.float64)
    return arr.tocsr()

As of Pandas version 0.25 SparseSeries and SparseDataFrame are deprecated.从 Pandas 0.25 版SparseSeries ,不推荐使用SparseSeriesSparseDataFrame DataFrames now support Sparse Dtypes for columns with sparse data. DataFrames 现在支持稀疏数据列的稀疏 Dtypes Sparse methods are available through sparse accessor, so conversion one-liner now looks like this:稀疏方法可通过sparse访问器获得,因此单行转换现在如下所示:

sparse_matrix = scipy.sparse.csr_matrix(df.sparse.to_coo())

Pandas docs talks about an experimental conversion to scipy sparse, SparseSeries.to_coo: Pandas 文档讨论了到 scipy 稀疏的实验性转换,SparseSeries.to_coo:

http://pandas-docs.github.io/pandas-docs-travis/sparse.html#interaction-with-scipy-sparse http://pandas-docs.github.io/pandas-docs-travis/sparse.html#interaction-with-scipy-sparse

================ ================

edit - this is a special function from a multiindex, not a data frame.编辑 - 这是来自多索引的特殊功能,而不是数据框。 See the other answers for that.请参阅其他答案。 Note the difference in dates.注意日期的差异。

============ ============

As of 0.20.0, there is a sdf.to_coo() and a multiindex ss.to_coo() .从 0.20.0 开始,有一个sdf.to_coo()和一个多ss.to_coo() Since a sparse matrix is inherently 2d, it makes sense to require multiindex for the (effectively) 1d dataseries.由于稀疏矩阵本质上是 2d 的,因此(有效)1d 数据序列需要多索引是有意义的。 While the dataframe can represent a table or 2d array.而数据框可以表示表格或二维数组。

When I first responded to this question this sparse dataframe/series feature was experimental (june 2015).当我第一次回答这个问题时,这个稀疏数据框/系列功能是实验性的(2015 年 6 月)。

Here's a solution that fills the sparse matrix column by column (assumes you can fit at least one column to memory).这是一个逐列填充稀疏矩阵的解决方案(假设您可以将至少一列放入内存)。

import pandas as pd
import numpy as np
from scipy.sparse import lil_matrix

def sparse_df_to_array(df):
    """ Convert sparse dataframe to sparse array csr_matrix used by
    scikit learn. """
    arr = lil_matrix(df.shape, dtype=np.float32)
    for i, col in enumerate(df.columns):
        ix = df[col] != 0
        arr[np.where(ix), i] = df.ix[ix, col]

    return arr.tocsr()

EDIT : This method is actually having a dense representation at some stage, so it doesn't solve the question.编辑:此方法实际上在某个阶段具有密集表示,因此它不能解决问题。

You should be able to use the experimental .to_coo() method in pandas [1] in the following way:您应该能够通过以下方式在.to_coo() [1] 中使用实验性的.to_coo()方法:

df, idx_rows, idx_cols = df.stack().to_sparse().to_coo()
df = df.tocsr()

This method, instead of taking a DataFrame (rows / columns) it takes a Series with rows and columns in a MultiIndex (this is why you need the .stack() method).这种方法,而不是采取一个DataFrame (行/列)它需要一个Series ,在一个行和列的MultiIndex (这就是为什么你需要的.stack()方法)。 This Series with the MultiIndex needs to be a SparseSeries , and even if your input is a SparseDataFrame , .stack() returns a regular Series .这个带有MultiIndex Series需要是一个SparseSeries ,即使你的输入是一个SparseDataFrame.stack()返回一个常规的Series So, you need to use the .to_sparse() method before calling .to_coo() .因此,您需要在调用.to_coo()之前使用.to_sparse()方法。

The Series returned by .stack() , even if it's not a SparseSeries only contains the elements that are not null, so it shouldn't take more memory than the sparse version (at least with np.nan when the type is np.float ).Series返回由.stack()即使它不是一个SparseSeries仅包含非空的元素,所以它应该不会超过稀疏的版本(至少有更多的内存np.nan当类型为np.float )。

  1. http://pandas.pydata.org/pandas-docs/stable/sparse.html#interaction-with-scipy-sparse http://pandas.pydata.org/pandas-docs/stable/sparse.html#interaction-with-scipy-sparse

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM