[英]More efficient way to access rows based on a list of indices in 2d numpy array?
So I have 2d numpay array arr.所以我有 2d numpay 数组 arr。 It's a relatively big one:
arr.shape = (2400, 60000)
这是一个比较大的:
arr.shape = (2400, 60000)
What I'm currently doing is the following:我目前正在做的事情如下:
arr.shape[0]
indicesarr.shape[0]
索引arr
arr
索引no_rows = arr.shape[0]
indicies = np.array(range(no_rows))
my_vals = []
for k in range(no_samples):
random_idxs = np.random.choice(indicies, size=no_rows, replace=True)
my_vals.append(
arr[random_idxs].mean(axis=0).max()
)
My problem is that is very slow.我的问题是速度很慢。 With my
arr
size, it takes ~3s for 1 loop.以我的
arr
大小,1 个循环大约需要 3 秒。 As I want a sample that is bigger than 1k - my current solution solution pretty bad (1k*~3s -> ~1h).因为我想要一个大于 1k 的样本 - 我目前的解决方案非常糟糕(1k*~3s -> ~1h)。 I've profiled it and the bottleneck is accessing row based on indices.
我已经对其进行了分析,瓶颈是基于索引访问行。
"mean"
and "max"
work fast. np.random.choice
"mean"
和"max"
工作fast. np.random.choice
fast. np.random.choice
is also ok. fast. np.random.choice
也可以。
Do you see any area for improvement?你觉得有什么需要改进的地方吗? A more efficient way of accessing indices or maybe better a faster approach that solves the problem without this?
一种更有效的访问索引的方法,或者更好的更快的方法来解决这个问题?
What I tried so far:到目前为止我尝试了什么:
random_idxs = np.random.choice(sample_idxs, size=sample_size, replace=True)
test = random_idxs.ravel()[arr.ravel()].reshape(arr.shape)
Since advanced indexing will generate a copy, the program will allocate huge memory in arr[random_idxs]
.由于高级索引会生成一个副本,因此程序将在
arr[random_idxs]
中分配巨大的 memory 。
So one of the most simple way to improve efficiency is that do things batch wise.因此,提高效率的最简单方法之一就是批量处理。
BATCH = 512
max(arr[random_idxs,i:i+BATCH].mean(axis=0).max() for i in range(0,arr.shape[1],BATCH))
This is not a general solution to the problem, but should make your specific problem much faster.这不是问题的一般解决方案,但应该使您的特定问题更快。 Basically,
arr.mean(axis=0).max()
won't change, so why not take random samples from that array?基本上,
arr.mean(axis=0).max()
不会改变,那么为什么不从该数组中抽取随机样本呢?
Something like:就像是:
mean_max = arr.mean(axis=0).max()
my_vals = np.array([np.random.choice(mean_max, size=len(mean_max), replace=True) for i in range(no_samples)])
You may even be able to do: my_vals = np.random.choice(mean_max, size=(no_samples, len(mean_max)), replace=True)
, but I'm not sure how, if at all, that would change your statistics.你甚至可以这样做:
my_vals = np.random.choice(mean_max, size=(no_samples, len(mean_max)), replace=True)
,但我不确定如果有的话,那会如何改变你的统计数据。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.