简体   繁体   中英

Apply udf element-wise in any-dimensional arrays

So I have the following function which given an array of numbers removes the outliers:

def reject_outliers(data, m = 1):
    d = np.abs(data - np.median(data))
    mdev = np.median(d)
    s = d/mdev if mdev else 0
    return np.array(data)[s<m]

vec_reject_outliers = np.vectorize(reject_outliers)

I would like to apply this function element-wise in multiple multidimensional arrays. In the following example I try to find the mean of each element between the 3 nested arrays in the list:

a = np.array([np.array([4000, np.array([12,10])]), np.array([50, np.array([13, 11])]), np.array([51,np.array([30,20])])])

result = [vec_reject_outliers(l, m = 1).mean(axis = 0) for l in zip(*a)]

The result should be [51, array([13, 11])] since certain numbers will be treated as outliers and thus removed from the calculation. Yet the result that I get is [1367.0, array([18.33333333, 13.66666667])] which is the mean calculated without omitting any element.

Is there a way to perform reject_outliers element wise in such scenarios or any other way I can achieve the expected result between any-dimentional arrays?

Add a print to your function to see exactly what data gets sent to it:

In [65]: def reject_outliers(data, m = 1):
    ...:     print('data',data)
    ...:     d = np.abs(data - np.median(data))
    ...:     mdev = np.median(d)
    ...:     s = d/mdev if mdev else 0
    ...:     return np.array(data)[s<m]
    ...: 
In [66]: vec_reject_outliers = np.vectorize(reject_outliers)
In [67]: result = [vec_reject_outliers(l, m = 1).mean(axis = 0) for l in zip(*a)]
data 4000              # trial run to determine otypes
data 4000
data 50
data 51
data 12           # trial run for 2nd tuple
data 12
data 10
data 13
data 11
data 30
data 20

So each call to reject is an element. That's what you wanted right?

apply this function element-wise in multiple multidimensional arrays

But let's look at your array

In [68]: a
Out[68]: 
array([[4000, array([12, 10])],
       [50, array([13, 11])],
       [51, array([30, 20])]], dtype=object)
In [69]: a.shape
Out[69]: (3, 2)
In [70]: a[:,0]
Out[70]: array([4000, 50, 51], dtype=object)
In [71]: a[:,1]
Out[71]: array([array([12, 10]), array([13, 11]), array([30, 20])], dtype=object)

And the l being passed to vec... - two tuples

In [77]: list(zip(*a))
Out[77]: [(4000, 50, 51), (array([12, 10]), array([13, 11]), array([30, 20]))]

So vec... is called twice, once with each of these 2 arrays:

In [82]: np.array(Out[77][0])
Out[82]: array([4000,   50,   51])
In [83]: np.array(Out[77][1])
Out[83]: 
array([[12, 10],
       [13, 11],
       [30, 20]])

In turn it calls reject... with the elements of Out[82] and then with the elements of Out[83] .

I haven't tried to figure out exactly what reject... does, or what it should be passed, but apparently vectorize is not working as you want. I'd suggest dropping that and doing an explicit iteration. That way you have more control.


Using reject in the comprehension:

In [84]: result = [reject_outliers(l, m = 1).mean(axis = 0) for l in zip(*a)]
data (4000, 50, 51) 1
data (array([12, 10]), array([13, 11]), array([30, 20])) 1
In [85]: result
Out[85]: [51.0, 12.0]

Applying reject separately to the 2nd column of a :

In [87]: result = [reject_outliers(l, m = 1).mean(axis = 0) for l in zip(*a[:,1])]
data (12, 13, 30) 1
data (10, 11, 20) 1
In [88]: result
Out[88]: [13.0, 11.0]

And using the 1st column of a directly:

In [91]: reject_outliers(a[:,0], m = 1).mean(axis = 0)
data [4000 50 51] 1
Out[91]: 51.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM