简体   繁体   中英

How to iterate through numpy array and remove anomalies?

I am a beginner with Python and programming in general. I am trying to write a program that iterates through a specific numpy array, and detects anomalies within the dataset (the definition of an anomaly is any point that is greater than 3 times the standard deviation from the mean WITHOUT the data point). I need to recalculate the mean and standard deviation for each time an anomalous data point is removed.

I have written the below code, but noticed a couple of issues. After the loop is iterated through once, it states that the value of 160 is removed, but when I print new_array, I still see 160 in the array.

Also, how could I recalculate the new mean for each time a data point is removed? I feel like something is just positioned incorrectly within the for loop. And finally is my use of continue correct or should it be placed elsewhere?

import numpy as np

data_array = np.array([
    99.5697438 ,  94.47019021,  55., 106.86672855,
   102.78730151, 131.85777845,  88.25376895,  96.94439838,
    83.67782174, 115.57993209, 118.97651966,  94.40479467,
    79.63342207,  77.88602065,  96.59145004,  99.50145353,
    97.25980235,  87.72010069, 101.30597215,  87.3110369 ,
   110.0687946 , 104.71504012,  89.34719772, 160.,
   110.61519268, 112.94716398, 104.41867586])

for cell in data_array:
    mean = np.mean(data_array, axis=0)
    sd = np.std(data_array, axis=0)
    lower_anomaly_point = mean - (3 * sd)
    upper_anomaly_point = mean + (3 * sd)
    if cell > upper_anomaly_point or cell < lower_anomaly_point:
        print(str(cell) + 'has been removed.')
        new_array = np.delete(data_array, cell)
        continue 

I think you should see Numpy Documentation and refer to the first line where they specifically say that it returns all the elements that don't conform with arr[obj], this means that numpy.delete() works in an index based manner. I would suggest you edit your code so as to get the index of that cell and then pass it onto np.delete()

Following is the edited code:

import numpy as np

data_array = np.array([99.5697438, 94.47019021, 55.0, 106.86672855, 102.78730151, 131.85777845, 88.25376895, 96.94439838, 83.67782174, 115.57993209, 118.97651966, 94.40479467, 79.63342207, 77.88602065, 96.59145004, 99.50145353, 97.25980235, 87.72010069, 101.30597215, 87.3110369, 110.0687946, 104.71504012, 89.34719772, 160.0, 110.61519268, 112.94716398, 104.41867586])
print(data_array)
for cell in data_array:
    mean = np.mean(data_array, axis=0)
    sd = np.std(data_array, axis=0)
    lower_anomaly_point = mean - (3 * sd)
    upper_anomaly_point = mean + (3 * sd)
    if cell > upper_anomaly_point or cell < lower_anomaly_point:
        print(str(cell) + 'has been removed.')
        index=np.where(data_array==cell)
        new_array = np.delete(data_array, obj=index)
        continue 

As @damagedcoda say your main error is you should use index instead the value, but you will have new problem if you will recalculate the lower_anomaly_point and upper_anomaly_point inside cycle. So i recommend you to try the np.where to solve your task:

import numpy as np

data_array = np.array([
    99.5697438 ,  94.47019021,  55., 106.86672855,
   102.78730151, 131.85777845,  88.25376895,  96.94439838,
    83.67782174, 115.57993209, 118.97651966,  94.40479467,
    79.63342207,  77.88602065,  96.59145004,  99.50145353,
    97.25980235,  87.72010069, 101.30597215,  87.3110369 ,
   110.0687946 , 104.71504012,  89.34719772, 160.,
   110.61519268, 112.94716398, 104.41867586])

mean = np.mean(data_array, axis=0)
sd = np.std(data_array, axis=0)
lower_anomaly_point = mean - (3 * sd)
upper_anomaly_point = mean + (3 * sd)

data_array = data_array[
    np.where(
        (upper_anomaly_point > data_array) & (data_array > lower_anomaly_point)
    )]

and result is:

array([ 99.5697438 ,  94.47019021,  55.        , 106.86672855,
       102.78730151, 131.85777845,  88.25376895,  96.94439838,
        83.67782174, 115.57993209, 118.97651966,  94.40479467,
        79.63342207,  77.88602065,  96.59145004,  99.50145353,
        97.25980235,  87.72010069, 101.30597215,  87.3110369 ,
       110.0687946 , 104.71504012,  89.34719772, 110.61519268,
       112.94716398, 104.41867586])

That code fails for me. The data_array does not change, np.delete returns new array, it does not change old one. You do not use new_array in any place of the code, you probably wanted to calculated mean from new_array The second argument for delete should be index, "indicates which subarray to remove". you cannot use cell.

import numpy as np

data_array = np.array([
    99.5697438 ,  94.47019021,  55., 106.86672855,
   102.78730151, 131.85777845,  88.25376895,  96.94439838,
    83.67782174, 115.57993209, 118.97651966,  94.40479467,
    79.63342207,  77.88602065,  96.59145004,  99.50145353,
    97.25980235,  87.72010069, 101.30597215,  87.3110369 ,
   110.0687946 , 104.71504012,  89.34719772, 160.,
   110.61519268, 112.94716398, 104.41867586])

mean = np.mean(data_array, axis=0)
sd = np.std(data_array, axis=0)
lower_anomaly_point = mean - (3 * sd)
upper_anomaly_point = mean + (3 * sd)
new_array = data_array.copy()
k = 0

for i, cell in enumerate(data_array):
    if cell > upper_anomaly_point or cell < lower_anomaly_point:
        print(str(cell) + 'has been removed.')
        new_array = np.delete(new_array, i - k)
        k += 1

new_array is data_array without 160. as you wished

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM