简体   繁体   English

Numpy意味着'inplace'

[英]Numpy mean 'inplace'

I have a line of code that looks like this: 我有一行代码如下:

te_succ_rate = np.mean(np.argmax(test_y, axis=1) == self.predictor(test_x))

where test_y is a numpy array of arrays and self.predictor(test_x) returns a numpy array. 其中test_y是一个numpy数组, self.predictor(test_x)返回一个numpy数组。 The whole line of code returns the percentage of subarrays in test_y that has a max value equal to the value in the corresponding position in the array returned from self.predictor(test_x) . 整行代码返回test_y的百分比,其最大值等于self.predictor(test_x)返回的数组中相应位置的值。

The problem is that for large sizes of test_y and test_x , it runs out of memory. 问题是对于大尺寸的test_ytest_x ,它会耗尽内存。 It works fine for 10 000, but not 60 000. 它适用于10 000,但不是6万。

Is there a way to avoid this? 有办法避免这种情况吗?

I tried this: 我试过这个:

tr_res = []
for start, end in zip(range(0, len(train_x), subsize), range(subsize, len(train_x), subsize)):
    tr_res.append(self.predictor(train_x[start:end]))
tr_res = np.asarray(tr_res)
tr_res = tr_res.flatten()
tr_succ_rate = np.mean(np.argmax(train_y, axis=1) == tr_res)

But it does not work as the result is somehow 0 (which is not correct). 但它不起作用,因为结果是某种程度上0(这是不正确的)。

Level 1: 1级:

Though this isn't an answer for doing it inline, it may still be an answer to your problem: 虽然这不是内联的答案,但它仍然可以解决您的问题:

You sure you're running out of memory from the mean and not the argmax ? 你确定你的mean而不是argmax的内存argmax吗?

Each additional dimension in test_y will be storing an extra N number of whatever datatype you're working with. test_y每个附加维度都将存储您正在使用的任何数据类型的额外N个数量。 Say you have 5 dimensions in your data, you'll have to store 5N values (presumably floats). 假设您的数据中有5个维度,则必须存储5N值(可能是浮点数)。 The results of your self.predictor(test_x) will take a 6th N of memory. self.predictor(test_x)的结果将占用内存的第6个N. The temporary array that is the answer to your conditional is a 7th N. I don't actually know what the memory usage of np.mean is, but I assume it's not another N. But for arguments sake, let's say it is. 作为条件的答案的临时数组是第7个N.我实际上并不知道np.mean的内存使用是np.mean ,但我认为它不是另一个N.但是为了论证,让我们说它是。 If you inline just np.mean , you'll only save up to an N of memory, while you already need 7N worth. 如果你只是内联np.mean ,你只需要节省N个内存,而你已经需要7N的价值。

So alternatively, try pulling out your np.argmax(test_y, axis=1) into an intermediate variable in a previous step and don't reference test_y again after calculating the argmax so test_y gets garbage collected. 因此,或者,尝试将np.argmax(test_y, axis=1)拉出到上一步中的中间变量中,并且在计算argmax后不再引用test_y ,以便test_y获取垃圾。 (or do whatever python 3 does to force deletion of that variable) That should save you the number of dimensions of your data minus 1 N of memory usage. (或者做任何python 3强制删除该变量的操作)这应该可以节省数据的维数减去1 N的内存使用量。 (you'll be down to around 3N or up to 4N memory usage, which is better than you could have achieved by in-lining just np.mean . (你将降低到大约3N或高达4N的内存使用量,这比通过内np.mean所能达到的np.mean

I made the assumption that running self.predictor(test_x) only takes 1N. 我假设运行self.predictor(test_x)只需要1N。 If it takes more, then pulling that out into its own intermediate variable in the same way will also help. 如果需要更多,那么以相同的方式将其拉出到自己的中间变量中也会有所帮助。

Level 2: 2级:

If that still isn't enough, still pull out your np.argmax(test_y, axis=1) and the self.predictor(test_x) into their own variables, then iterate across the two arrays yourself and do the conditional and aggregation yourself. 如果仍然不够,仍然将你的np.argmax(test_y, axis=1)self.predictor(test_x)到自己的变量中,然后自己迭代两个数组并自己进行条件和聚合。 Something like: 就像是:

sum = 0.
n = 0
correct_ans = np.argmax(test_y, axis=1)
returned_ans = self.predictor(test_x)
for c, r in zip(correct_ans, returned_ans):
    if c == r:
        sum += 1
    n += 1
avg = sum / n

(not sure if zip is the best way to do this. np probably has a more efficient way to do the same thing. This is the second thing you tried, but accumulating the aggregates without storing an additional array) That way, you'll also save the need to store the temporary list of booleans resulting from your conditional. (不确定zip是否是执行此操作的最佳方式np可能有更有效的方法来执行相同的操作。这是您尝试的第二件事,但累积聚合而不存储其他数组)这样,您将会还可以节省存储条件导致的布尔值临时列表的需要。

If that still isn't enough, you're going to have to fundamentally change how you're storing your actual and target results, since the issue becomes you not being able to fit just the target and results into memory. 如果仍然不够,那么您将不得不从根本上改变存储实际和目标结果的方式,因为问题变得无法使目标和结果适合内存。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM