简体   繁体   English

如何使用元素操作获取多个 numpy 保存数组的平均值和标准

[英]How to get Mean & Std of multiple numpy saved arrays using element-wise operation

I have a folder with 1000 numpy compressed files (npz) representing the results of a data simulation.我有一个文件夹,其中包含 1000 个 numpy 压缩文件 (npz),代表数据模拟的结果。 Each file has two arrays a and b , with same dimension, shape, data type.每个文件有两个数组ab ,具有相同的维度、形状、数据类型。 What I want as a final output is the element-wise mean and standard deviation arrays of a , b and c (which I'm creating in the example below), taking into account all the simulation ie:我想要作为最终输出的是abc (我在下面的示例中创建)的元素均值和标准偏差数组,考虑到所有模拟,即:

mean_a = np.mean(a1,a2,a3,...a1000)

std_a = np.std(a1,a2,a3...a1000) , etc. std_a = np.std(a1,a2,a3...a1000)

I've managed to get the mean values, but not using direct element-wise operation.我设法获得了平均值,但没有使用直接的逐元素操作。 What I'm most struggling is getting the STD.我最挣扎的是得到性病。 I've tried to append all the arrays into lists, but I'm getting the problem of Memory Error.我试图将所有数组附加到列表中,但我遇到了内存错误的问题。 Any idea of how shall I proceed?知道我将如何进行吗? See below what I've achieved so far.请参阅下面我迄今为止取得的成就。 Thanks in advance!!提前致谢!!

import glob
import numpy as np
import os 

simulation_runs = 10
simulation_range = np.arange(simulation_runs)

npFiles = [npFile for npFile in glob.iglob(os.path.join(outDir, "sc0*.npz"))]

a_accum = np.empty([885, 854], dtype=np.float32)
b_accum = np.empty([885, 854], dtype=np.float32)    
c_accum = np.empty([885, 854], dtype=np.float32)    

for run, i in enumerate(npFiles):
    npData = np.load(i)
    a = npData['scc'] 
    b = npData['bcc']
    c = a+b
    a_accum  = a + a_accum
    b_accum = b + b_accum   
    c_accum = c + b_accum   

aMean = a_accum/len(simulation_range)
bMean= b_accum/len(simulation_range)
cMean = c_accum/len(simulation_range)

Firstly, if you have (ssh) access to a machine with more memory, that's easiest.首先,如果您可以 (ssh) 访问具有更多内存的机器,那是最简单的。 Maybe you can even manage without one.也许你甚至可以在没有一个的情况下进行管理。 885*854*(1000 simulations)*(4 bytes per float32) = 2.8 GiB, so if you do a, b, and c separately, you should have enough memory on a reasonable machine. 885*854*(1000 次模拟)*(4 bytes per float32) = 2.8 GiB,所以如果你分别做 a、b 和 c,你应该在一台合理的机器上有足够的内存。 In that case, just put them into an array, and use np.mean and np.std:在这种情况下,只需将它们放入一个数组中,然后使用 np.mean 和 np.std:

a = np.zeros((1000,885,854), dtype=np.float32)
for run, i in enumerate(npFiles):
    a[i]=np.load(run)['scc']
amean = a.mean(axis=0)
astd = a.std(axis=0)

And similarly for b and c. b 和 c 也类似。

Otherwise, the most elegant option is to save the data in a format that can easily be lazily loaded.否则,最优雅的选择是以易于延迟加载的格式保存数据。 dask was specifically designed for this, but can take some time to learn (might be worth it in the long run though). dask是专门为此设计的,但可能需要一些时间来学习(但从长远来看可能是值得的)。 You can also store it in netcat files and use xarray as a sort-of frontend for dask , maybe that's more convenient even.您还可以将它存储在 netcat 文件中,并使用xarray作为xarray的某种前端, dask可能更方便。

If you only need the mean, std, you can do it manually.如果您只需要均值、标准差,则可以手动完成。 The formula for std is std 的公式是

std = sqrt(mean(abs(x - x.mean())**2))

So since you already have the means, the procedure will work very similar to what you already did: (untested)因此,由于您已经掌握了方法,因此该过程将与您已经完成的工作非常相似:(未经测试)

import numpy as np
import os 

simulation_runs = 10
simulation_range = np.arange(simulation_runs)

npFiles = [npFile for npFile in glob.iglob(os.path.join(outDir, "sc0*.npz"))]

a_accum = np.empty([885, 854], dtype=np.float32)
b_accum = np.empty([885, 854], dtype=np.float32)    
c_accum = np.empty([885, 854], dtype=np.float32)    

for run, i in enumerate(npFiles):
    npData = np.load(i)
    a = npData['scc'] 
    b = npData['bcc']
    c = a+b
    a_accum  = a + a_accum
    b_accum = b + b_accum   
    c_accum = c + b_accum   

aMean = a_accum/len(simulation_range)
bMean= b_accum/len(simulation_range)
cMean = c_accum/len(simulation_range)


a_sumsq = np.empty([885, 854], dtype=np.float32)
b_sumsq = np.empty([885, 854], dtype=np.float32)    
c_sumsq = np.empty([885, 854], dtype=np.float32)    

for run, i in enumerate(npFiles):
    npData = np.load(i)
    a = npData['scc'] 
    b = npData['bcc']
    c = a+b
    a_sumsq += (a-aMean)**2
    b_sumsq += (b-bMean)**2
    c_sumsq += (c-cMean)**2

a_std = np.sqrt(a_sumsq/(len(npFiles)-1)) # The -1 is to get an unbiased estimator
b_std = np.sqrt(b_sumsq/(len(npFiles)-1))
c_std = np.sqrt(c_sumsq/(len(npFiles)-1))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM