简体   繁体   English

如何使用numpy计算不规则形状的数组的均值和标准差

[英]How to use numpy to calculate mean and standard deviation of an irregular shaped array

I have a numpy array that has many samples in it of varying length 我有一个numpy数组,其中有许多不同长度的样本

Samples = np.array([[1001, 1002, 1003],
                    ... ,
                    [1001, 1002]])

I want to (elementwise) subtract the mean of the array then divide by the standard deviation of the array. 我想(基本)减去数组的平均值,然后除以数组的标准偏差。 Something like: 就像是:

newSamples = (Samples-np.mean(Samples))/np.std(Samples)

Except that doesn't work for irregular shaped arrays, 除非这不适用于不规则形状的阵列,

np.mean(Samples) causes np.mean(Samples)原因

unsupported operand type(s) for /: 'list' and 'int'

due to what I assume to be it having set a static size for each axis and then when it encounters a different sized sample it can't handle it. 由于我假设它已经为每个轴设置了静态大小,然后在遇到不同大小的样本时无法处理它。 What is an approach to solve this using numpy? 使用numpy解决此问题的方法是什么?

example input: 输入示例:

Sample = np.array([[1, 2, 3],
                   [1, 2]])

After subtracting by the mean and then dividing by standard deviation: 用平均值减去然后除以标准偏差后:

Sample = array([[-1.06904497,  0.26726124,  1.60356745], 
                [-1.06904497,  0.26726124]])

Don't make ragged arrays. 不要制作参差不齐的数组。 Just don't. 只是不要。 Numpy can't do much with them, and any code you might make for them will always be unreliable and slow because numpy doesn't work that way. Numpy不能对它们做太多事情,并且您可能为它们编写的任何代码总是不可靠且缓慢,因为numpy不能那样工作。 It turns them into object dtypes: 它将它们变成object dtype:

Sample
array([[1, 2, 3], [1, 2]], dtype=object)

Which almost no numpy functions work on. 几乎没有numpy函数可以使用。 In this case those objects are list objects, which makes your code even more confusing as you either have to switch between list and ndarray methods, or stick to list-safe numpy methods. 在这种情况下,这些对象是list对象,这使您的代码更加混乱,因为您必须在listndarray方法之间切换,或者坚持使用列表安全的numpy方法。 This a recipe for disaster as anyone noodling around with the code later (even yourself if you forget) will be dancing in a minefield. 这是一个灾难的秘诀,因为任何人稍后在代码中闲逛(即使您自己也忘记了),都将在雷区中跳舞。

There's two things you can do with your data to make things work better: 您可以通过两件事来使数据工作得更好:

First method is to index and flatten. 第一种方法是索引和展平。

i = np.cumsum(np.array([len(x) for x in Sample]))
flat_sample = np.hstack(Sample)

This preserves the index of the end of each sample in i , while keeping the sample as a 1D array 这样可以保留i中每个样本结尾的索引,同时将样本保留为一维数组

The other method is to pad one dimension with np.nan and use nan -safe functions 另一种方法是使用np.nan填充一维并使用nan np.nan函数

m = np.array([len(x) for x in Sample]).max()
nan_sample = np.array([x + [np.nan] * (m - len(x)) for x in Sample])

So to do your calculations, you can use flat_sample and do similar to above: 因此,要进行计算,可以使用flat_sample并执行与上面类似的操作:

new_flat_sample = (flat_sample - np.mean(flat_sample)) / np.std(flat_sample) 

and use i to recreate your original array (or list of arrays, which I recommend:, see np.split ). 并使用i重新创建您的原始数组(或我建议的数组列表:,请参阅np.split )。

new_list_sample = np.split(new_flat_sample, i[:-1])

[array([-1.06904497,  0.26726124,  1.60356745]),
 array([-1.06904497,  0.26726124])]

Or use nan_sample , but you will need to replace np.mean and np.std with np.nanmean and np.nanstd 或使用nan_sample ,但您需要将np.meannp.std替换为np.nanmeannp.nanstd

new_nan_sample = (nan_sample - np.nanmean(nan_sample)) / np.nanstd(nan_sample)

array([[-1.06904497,  0.26726124,  1.60356745],
       [-1.06904497,  0.26726124,         nan]])

@MichaelHackman (following the comment remark). @MichaelHackman(在评论之后)。 That's weird because when I compute the overall std and mean then apply it, I obtain different result (see code below). 这很奇怪,因为当我计算总体std并平均然后应用它时,我得到了不同的结果(请参见下面的代码)。

import numpy as np

Samples = np.array([[1, 2, 3],
                   [1, 2]])
c = np.hstack(Samples)  # Will gives [1,2,3,1,2]
mean, std = np.mean(c), np.std(c)
newSamples = np.asarray([(np.array(xi)-mean)/std for xi in Samples])
print newSamples
# [array([-1.06904497,  0.26726124,  1.60356745]), array([-1.06904497,  0.26726124])]

edit : Add np.asarray(), put mean,std computation outside loop following Imanol Luengo's excellent comments (Thanks!) 编辑 :添加np.asarray(),在Imanol Luengo的精彩评论之后mean,std在循环外放置mean,std计算(谢谢!)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM