简体   繁体   English

基于二维 numpy 数组中的索引列表访问行的更有效方法?

[英]More efficient way to access rows based on a list of indices in 2d numpy array?

So I have 2d numpay array arr.所以我有 2d numpay 数组 arr。 It's a relatively big one: arr.shape = (2400, 60000)这是一个比较大的: arr.shape = (2400, 60000)

What I'm currently doing is the following:我目前正在做的事情如下:

  • randomly (with replacement) select arr.shape[0] indices随机(带替换) select arr.shape[0]索引
  • access (row-wise) chosen indices of arr访问(按行)选择的arr索引
  • calculating column-wise averages and selecting max value计算列平均值并选择最大值
  • I'm repeating it for k times我重复了k次

It looks sth like:它看起来像:

no_rows = arr.shape[0]
indicies = np.array(range(no_rows))
my_vals = []
for k in range(no_samples):
    random_idxs = np.random.choice(indicies, size=no_rows, replace=True)
    my_vals.append(
        arr[random_idxs].mean(axis=0).max()
    )

My problem is that is very slow.我的问题是速度很慢。 With my arr size, it takes ~3s for 1 loop.以我的arr大小,1 个循环大约需要 3 秒。 As I want a sample that is bigger than 1k - my current solution solution pretty bad (1k*~3s -> ~1h).因为我想要一个大于 1k 的样本 - 我目前的解决方案非常糟糕(1k*~3s -> ~1h)。 I've profiled it and the bottleneck is accessing row based on indices.我已经对其进行了分析,瓶颈是基于索引访问行。 "mean" and "max" work fast. np.random.choice "mean""max"工作fast. np.random.choice fast. np.random.choice is also ok. fast. np.random.choice也可以。

Do you see any area for improvement?你觉得有什么需要改进的地方吗? A more efficient way of accessing indices or maybe better a faster approach that solves the problem without this?一种更有效的访问索引的方法,或者更好的更快的方法来解决这个问题?

What I tried so far:到目前为止我尝试了什么:

  • numpy.take (slower) numpy.take(较慢)
  • numpy.ravel: numpy.ravel:

sth similar to:某事类似于:

random_idxs = np.random.choice(sample_idxs, size=sample_size, replace=True) 
test = random_idxs.ravel()[arr.ravel()].reshape(arr.shape)
  • similar approach to current one but without loop.与当前方法类似,但没有循环。 I created 3d arr and accessed rows across additional dimension in one go我创建了 3d arr 并在一个 go 中跨其他维度访问了行

Since advanced indexing will generate a copy, the program will allocate huge memory in arr[random_idxs] .由于高级索引会生成一个副本,因此程序将在arr[random_idxs]中分配巨大的 memory 。

So one of the most simple way to improve efficiency is that do things batch wise.因此,提高效率的最简单方法之一就是批量处理。

BATCH = 512
max(arr[random_idxs,i:i+BATCH].mean(axis=0).max() for i in range(0,arr.shape[1],BATCH))

This is not a general solution to the problem, but should make your specific problem much faster.这不是问题的一般解决方案,但应该使您的特定问题更快。 Basically, arr.mean(axis=0).max() won't change, so why not take random samples from that array?基本上, arr.mean(axis=0).max()不会改变,那么为什么不从该数组中抽取随机样本呢?

Something like:就像是:

mean_max = arr.mean(axis=0).max()
my_vals = np.array([np.random.choice(mean_max, size=len(mean_max), replace=True) for i in range(no_samples)])

You may even be able to do: my_vals = np.random.choice(mean_max, size=(no_samples, len(mean_max)), replace=True) , but I'm not sure how, if at all, that would change your statistics.你甚至可以这样做: my_vals = np.random.choice(mean_max, size=(no_samples, len(mean_max)), replace=True) ,但我不确定如果有的话,那会如何改变你的统计数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM