简体   繁体   English

将大量索引(从Pandas数据帧)加载到稀疏矩阵的快速方法?

[英]Fast way to load a lot of indexes (from a Pandas dataframe) into a sparse matrix?

I've got a large Pandas dataframe with 1.500.000 rows, and one column contains lists with numbers. 我有一个带有1.500.000行的大型Pandas数据框,其中一列包含带有数字的列表。 You can imagine it like this 你可以这样想

df = pd.DataFrame({'lists' : [[0, 1, 2], [6, 7, 8], [3, 4, 5]]})

but way bigger. 但更大。 in the end I want a matrix, that looks like this 最后我想要一个矩阵,看起来像这样

[1, 1, 1, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 1, 1, 1]
[0, 0, 0, 1, 1, 1, 0, 0, 0]

so the row index of the df is the row index of the matrix and the numbers in the list are the column indexes that need to be set to True. 因此df的行索引是矩阵的行索引,列表中的数字是需要设置为True的列索引。

The matrix will be of shape 1.500.000 x 30.000, but this takes too much RAM, so I save the matrix with lil_matrix(), and then later I can form the matrix batch for batch. 矩阵的形状为1.500.000 x 30.000,但这会占用太多RAM,因此我使用lil_matrix()保存了矩阵,然后稍后可以成批地形成矩阵批次。

The way I do this right now is the following: 我现在这样做的方式如下:

sparse_matrix = sparse.lil_matrix((1.500.000, 30.000), dtype=bool)
list_with_lists = df["lists"].tolist()
for i, list in enumerate(list_with_lists):
    for number in list:
        sparse_matrix[i, number] = True

It works, but it takes a couple of minutes and I really hope there is a faster way as this takes too much times. 它可以工作,但是要花几分钟,我真的希望有一个更快的方法,因为这会花费很多时间。 Does anyone know a faster way? 有谁知道更快的方法?

Not sure how this will work with the scipy.sparse.lil_matrix , but try using advanced indexing: 不确定scipy.sparse.lil_matrix如何工作,但请尝试使用高级索引:

rows = np.arange(m.shape[0])[:, np.newaxis]
cols = df['lists'].tolist()
m[rows, cols] = 1

Basically, we're saying set every [row, column] pair found here to True . 基本上,我们是说将此处找到的每个[row, column]对都设置为True row looks like [[1], [2], [3], ..., N] for a N * M matrix and cols is your series. 对于N * M矩阵, row看起来像[[1], [2], [3], ..., N] ,而cols是您的级数。

With a test case 带有测试用例

import pandas as pd
import numpy as np

df = pd.DataFrame({'lists' : [[0, 1, 2], [6, 7, 8], [3, 4, 5]]})

m = np.zeros((3, 9), dtype=bool)

rows = np.arange(m.shape[0])[:, np.newaxis]
cols = df['lists'].tolist()
m[rows, cols] = True

print(m.view(np.int8))

I get 我懂了

[[1 1 1 0 0 0 0 0 0]
 [0 0 0 0 0 0 1 1 1]
 [0 0 0 1 1 1 0 0 0]]

You can try dok_matrix with its update function. 您可以尝试使用具有update功能的dok_matrix You will need to prepare a list of the form ((row_idx, col_idx), val) and pass it in the update function. 您将需要准备一个表单列表((row_idx,col_idx),val)并将其传递给update函数。 Here, I tried to use the map and reduce to create the list. 在这里,我尝试使用mapreduce以创建列表。

from itertools import chain
from scipy import sparse

df = pd.DataFrame({'lists' : [[0, 1, 2], [6, 7, 8], [3, 4, 5]]})
sparse_matrix = sparse.dok_matrix((1500000, 30000), dtype=bool)
list_with_lists = df["lists"].tolist()


update_list = chain.from_iterable(map(lambda l, r: [((r, i), 1) for i in l], 
                                      list_with_lists, 
                                      range(len(list_with_lists))))    

# update_list 
[((0, 0), 1),
 ((0, 1), 1),
 ((0, 2), 1),
 ((1, 6), 1),
 ((1, 7), 1),
 ((1, 8), 1),
 ((2, 3), 1),
 ((2, 4), 1),
 ((2, 5), 1)]

sparse_matrix.update(update_list)

Timing 定时

Setup 设定

from numpy.random import randint
df = pd.DataFrame({'lists' : [randint(0, 30000, 10) for i in range(10000)]})
list_with_lists = df["lists"].tolist()

Using lil_matrix with double-loop update 在双循环更新中使用lil_matrix

sparse_matrix = sparse.lil_matrix((1500000, 30000), dtype=bool)
%%timeit  # OP's original way of adding using lil_matrix
for i, list in enumerate(list_with_lists):
    for number in list:
        sparse_matrix[i, number] = True
635 ms ± 26.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Using dok_matrix with update . update使用dok_matrix

sparse_matrix = sparse.dok_matrix((1500000, 30000), dtype=bool)
%%timeit # updating using `update` function in dok_matrix
update_list = chain.from_iterable(map(lambda l, r: [((r, i), 1) for i in l], 
                              list_with_lists, 
                              range(len(list_with_lists))))    
sparse_matrix.update(update_list)
48.7 ms ± 6.84 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

dok_matrix , however, may be slow in other matrix operations. 但是, dok_matrix在其他矩阵运算中可能会变慢。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM