简体   繁体   English

使用熊猫数据框在numpy数组中设置索引

[英]using pandas dataframe to set indices in numpy array

I have a pandas dataframe with indices to a numpy array. 我有一个带有numpy数组索引的pandas数据框。 The value of the array has to be set to 1 for those indices. 对于那些索引,必须将数组的值设置为1。 I need to do this millions of times on a big numpy array. 我需要在一个大的numpy数组上执行此操作数百万次。 Is there a more efficient way than the approach shown below? 有没有比下面显示的方法更有效的方法?

from numpy import float32, uint
from numpy.random import choice
from pandas import DataFrame
from timeit import timeit

xy = 2000,300000
sz = 10000000
ind = DataFrame({"i":choice(range(xy[0]),sz),"j":choice(range(xy[1]),sz)}).drop_duplicates()
dtype = uint
repeats = 10

#original (~21s)
stmt = '''\
from numpy import zeros
a = zeros(xy, dtype=dtype)
a[ind.values[:,0],ind.values[:,1]] = 1'''

print(timeit(stmt, "from __main__ import xy,sz,ind,dtype", number=repeats))

#suggested by @piRSquared (~13s)
stmt = '''\
from numpy import ones
from scipy.sparse import coo_matrix
i,j = ind.i.values,ind.j.values
a = coo_matrix((ones(i.size, dtype=dtype), (i, j)), dtype=dtype).toarray()
'''

print(timeit(stmt, "from __main__ import xy,sz,ind,dtype", number=repeats))

I have edited the above post to show the approach(es) suggested by @piRSquared and re-wrote it to allow an apples-to-apples comparison. 我已经编辑了以上文章,以显示@piRSquared建议的方法,并将其重新编写以允许进行苹果对苹果的比较。 Irrespective of the data type (tried uint and float32), the suggested approach has a 40% reduction in time. 无论数据类型如何(尝试使用uint和float32),建议的方法都将时间减少40%。

OP time OP时间

56.56 s

I can only marginally improve with 我只能勉强改善

i, j = ind.i.values, ind.j.values
a[i, j] = 1

New Time 新时代

52.19 s

However, you can considerably speed this up by using scipy.sparse.coo_matrix to instantiate a sparse matrix and then convert it to a numpy.array . 但是,通过使用scipy.sparse.coo_matrix实例化稀疏矩阵,然后将其转换为numpy.array ,可以大大加快此过程。

import timeit

stmt = '''\
import numpy, pandas
from scipy.sparse import coo_matrix

xy = 2000,300000

sz = 10000000
ind = pandas.DataFrame({"i":numpy.random.choice(range(xy[0]),sz),"j":numpy.random.choice(range(xy[1]),sz)}).drop_duplicates()

################################################
i, j = ind.i.values, ind.j.values
dtype = numpy.uint8
a = coo_matrix((numpy.ones(i.size, dtype=dtype), (i, j)), dtype=dtype).toarray()'''

timeit.timeit(stmt, number=10)

33.06471237000369

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM