在Python中从大型数据框创建稀疏矩阵

Question

I'm trying to use a sparse matrix in my regression since there are over 40,000 variables after I add dummy variables. 我尝试在回归中使用稀疏矩阵，因为添加虚拟变量后有40,000多个变量。 In order to do this, I believe I need to feed the model a sparse matrix. 为了做到这一点，我相信我需要为模型提供一个稀疏矩阵。 However, converting my pandas dataframe into a matrix isn't possible using code found here: 但是，无法使用以下代码将我的pandas数据帧转换为矩阵：

Convert Pandas dataframe to Sparse Numpy Matrix directly 直接将Pandas数据框转换为稀疏Numpy矩阵

This is because the dataset is too large, and I run into a memory error. 这是因为数据集太大，我遇到了内存错误。 Here's an example of how I can replicate the issue by running the following: 这是我如何通过运行以下命令来复制问题的示例：

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(0,40000,size=(1000000, 4)), columns=list('ABCD'))
df = pd.get_dummies(df,columns=['D'],sparse=True,drop_first=True)
df = df.values

I'd ultimately like to be able to convert the dataframe (3 million records with 49,000 columns) into a matrix because I suspect I can create a sparse matrix and use that for my regression. 我最终希望能够将数据框（300万条记录和49,000列）转换为矩阵，因为我怀疑可以创建稀疏矩阵并将其用于回归。 This works quite well on a smaller subset, but I ultimately need to test the entire dataset. 这在较小的子集上效果很好，但是我最终需要测试整个数据集。 The above example yields a "MemoryError" right away, so I suspect it's some Python limitation, but I am hoping there is a workaround. 上面的示例立即产生了“ MemoryError”，因此我怀疑这是某些Python限制，但我希望有一个解决方法。

Answer 1

Sparse matrix is costly operation. 稀疏矩阵是昂贵的操作。 Using Spicy, it is very difficult to create large sparse matrix and your system memory might not support. 使用Spicy，很难创建大型的稀疏矩阵，并且系统内存可能不支持。

I suggest to use Spark libraries. 我建议使用Spark库。 So that your data set will run on different clusters (RDD). 这样您的数据集将在不同的群集（RDD）上运行。 below is the sample code, 下面是示例代码，

from pyspark.mllib.linalg import Vectors sparse = Vectors.sparse(3, [0, 2], [1.0, 3.0])

I hope it helps you. 希望对您有帮助。 Please let me know if you still have any questions, i would be very happy to help you. 如果您还有任何疑问，请告诉我，我将非常乐意为您提供帮助。

Answer 2

You can do that like this: 您可以这样做：

import numpy as np
import pandas as pd
import scipy.sparse

N = 40000
M = 1000000
df = pd.DataFrame(np.random.randint(0, N, size=(M, 4)), columns=list('ABCD'))
v = df['D'].values
sp = scipy.sparse.coo_matrix((np.ones_like(v), (np.arange(len(v)), v)), shape=[len(v), N])
print(sp.shape)
# (1000000, 40000)
print(sp.getnnz())
# 1000000

在Python中从大型数据框创建稀疏矩阵

问题描述

2 个解决方案

解决方案1
0 2019-04-05 14:56:19

解决方案2
0 2019-04-05 14:58:06

在Python中从大型数据框创建稀疏矩阵

问题描述

2 个解决方案

解决方案1 0 2019-04-05 14:56:19

解决方案2 0 2019-04-05 14:58:06

解决方案1
0 2019-04-05 14:56:19

解决方案2
0 2019-04-05 14:58:06