简体   繁体   English

基于第二列间隔的列值平均值

[英]Average of column values based on intervals of a second column

I have a dataset that has two columns, column 1 is the time which goes from 1 to 9 seconds and column 2 is the probability of an event in a specific second with values of 30, 69, 56, 70, 90, 59, 87, 10, 20.我有一个包含两列的数据集,第 1 列是从 1 到 9 秒的时间,第 2 列是特定秒内事件发生的概率,值为 30、69、56、70、90、59、87 , 10, 20。

I am trying to get the average probability in a time interval (after 2 seconds for this case), like the probability between 2 to 3 seconds, 2 to 4 seconds, 2 to 5 seconds,....2 to 9 seconds.我试图获得一个时间间隔内的平均概率(在这种情况下为 2 秒后),例如 2 到 3 秒、2 到 4 秒、2 到 5 秒、....2 到 9 秒之间的概率。

I tried the following approach where I defined a function t_inc which has increments of 1 greater than 2. However, I am getting the following error msg ( P_slice_avg_1 in the code):我尝试了以下方法,其中我定义了一个函数t_inc ,它的增量为 1 大于 2。但是,我收到以下错误消息(代码中的P_slice_avg_1 ):

Operands could not be broadcast together with shapes (9,) (7,)操作数无法与形状一起广播 (9,) (7,)

because my t_inc has a shape of 7.因为我的 t_inc 的形状是 7。

When I tried to do it in a manual way ( P_slice_avg_2 in the code) it works but not feasible if I want to do it for a huge number of intervals.当我尝试以手动方式( P_slice_avg_2中的P_slice_avg_2 )执行此操作时,它可以工作,但如果我想在大量时间间隔内执行此操作,则不可行。

Any help in how to generalize it would be greatly helpful.任何关于如何概括它的帮助都会非常有帮助。

import numpy as np
data=np.loadtxt('C:/Users/Hrihaan/Desktop/Sample.txt')

t=data[:,0] # t goes from 1 to 9
P=data[:,1] # probability of an event in a specific second

i= np.arange(1, 8 , 1)
t_inc= 2 + i 

P_slice_avg_1= np.mean(P[(t>=2) & (t<=t_inc)]) # I thought this would give me the averages between 2 and values of t_inc

P_slice_avg_2= np.mean(P[(t>=2) & (t<=3)]), np.mean(P[(t>=2) & (t<=4)]), np.mean(P[(t>=2) & (t<=5)]), np.mean(P[(t>=2) & (t<=6)]), np.mean(P[(t>=2) & (t<=7)]), np.mean(P[(t>=2) & (t<=8)]), np.mean(P[(t>=2) & (t<=9)])

Here a vectorized approach exploiting numpy broadcasting :这是一种利用numpy 广播的矢量化方法:

import numpy as np
t = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9]) 
P = np.array([30, 69, 56, 70, 90, 59, 87, 10, 20], dtype=float) 
i = np.arange(1, 8 , 1)
t_inc= 2 + i 

T = np.tile(t[:,None], len(i))
P = np.tile(P[:,None], len(i))

np.tile constructs an array by repeating it the number of given times, in this case we will have len(i) copies of t and of P , namely: np.tile通过重复给定次数构造一个数组,在这种情况下,我们将拥有tP len(i)副本,即:

P
array([[30., 30., 30., 30., 30., 30., 30.],
       [69., 69., 69., 69., 69., 69., 69.],
       [56., 56., 56., 56., 56., 56., 56.],
       [70., 70., 70., 70., 70., 70., 70.],
       [90., 90., 90., 90., 90., 90., 90.],
       [59., 59., 59., 59., 59., 59., 59.],
       [87., 87., 87., 87., 87., 87., 87.],
       [10., 10., 10., 10., 10., 10., 10.],
       [20., 20., 20., 20., 20., 20., 20.]])

Now we set to zero all the elements not satisfying the required condition using np.logical_or :现在我们使用np.logical_or将所有不满足所需条件的元素设置为零:

P[np.logical_or(2>T, T>t_inc)]=0
P
array([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [69., 69., 69., 69., 69., 69., 69.],
       [56., 56., 56., 56., 56., 56., 56.],
       [ 0., 70., 70., 70., 70., 70., 70.],
       [ 0.,  0., 90., 90., 90., 90., 90.],
       [ 0.,  0.,  0., 59., 59., 59., 59.],
       [ 0.,  0.,  0.,  0., 87., 87., 87.],
       [ 0.,  0.,  0.,  0.,  0., 10., 10.],
       [ 0.,  0.,  0.,  0.,  0.,  0., 20.]])

In this way we are storing in each column exactly the elements to average, however using np.mean would yield the wrong result since the denominator would be P.shape[0] , ie counting also the zero-ed elements.通过这种方式,我们在每一列中准确地存储要平均的元素,但是使用np.mean会产生错误的结果,因为分母将是P.shape[0] ,即也计算归零的元素。 As a workaround we can sum along the axis and divide by the total count of non-zero elements using np.count_nonzero :作为一种解决方法,我们可以使用np.count_nonzero沿轴求和并除以非零元素的np.count_nonzero

np.sum(P, axis=0)/np.count_nonzero(P, axis=0)
array([62.5, 65., 71.25, 68.8, 71.83333333, 63., 57.625])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM