I am a beginner with Python and programming in general. I am trying to write a program that iterates through a specific numpy array, and detects anomalies within the dataset (the definition of an anomaly is any point that is greater than 3 times the standard deviation from the mean WITHOUT the data point). I need to recalculate the mean and standard deviation for each time an anomalous data point is removed.
I have written the below code, but noticed a couple of issues. After the loop is iterated through once, it states that the value of 160 is removed, but when I print new_array, I still see 160 in the array.
Also, how could I recalculate the new mean for each time a data point is removed? I feel like something is just positioned incorrectly within the for loop. And finally is my use of continue correct or should it be placed elsewhere?
import numpy as np
data_array = np.array([
99.5697438 , 94.47019021, 55., 106.86672855,
102.78730151, 131.85777845, 88.25376895, 96.94439838,
83.67782174, 115.57993209, 118.97651966, 94.40479467,
79.63342207, 77.88602065, 96.59145004, 99.50145353,
97.25980235, 87.72010069, 101.30597215, 87.3110369 ,
110.0687946 , 104.71504012, 89.34719772, 160.,
110.61519268, 112.94716398, 104.41867586])
for cell in data_array:
mean = np.mean(data_array, axis=0)
sd = np.std(data_array, axis=0)
lower_anomaly_point = mean - (3 * sd)
upper_anomaly_point = mean + (3 * sd)
if cell > upper_anomaly_point or cell < lower_anomaly_point:
print(str(cell) + 'has been removed.')
new_array = np.delete(data_array, cell)
continue
I think you should see Numpy Documentation and refer to the first line where they specifically say that it returns all the elements that don't conform with arr[obj], this means that numpy.delete()
works in an index based manner. I would suggest you edit your code so as to get the index of that cell and then pass it onto np.delete()
Following is the edited code:
import numpy as np
data_array = np.array([99.5697438, 94.47019021, 55.0, 106.86672855, 102.78730151, 131.85777845, 88.25376895, 96.94439838, 83.67782174, 115.57993209, 118.97651966, 94.40479467, 79.63342207, 77.88602065, 96.59145004, 99.50145353, 97.25980235, 87.72010069, 101.30597215, 87.3110369, 110.0687946, 104.71504012, 89.34719772, 160.0, 110.61519268, 112.94716398, 104.41867586])
print(data_array)
for cell in data_array:
mean = np.mean(data_array, axis=0)
sd = np.std(data_array, axis=0)
lower_anomaly_point = mean - (3 * sd)
upper_anomaly_point = mean + (3 * sd)
if cell > upper_anomaly_point or cell < lower_anomaly_point:
print(str(cell) + 'has been removed.')
index=np.where(data_array==cell)
new_array = np.delete(data_array, obj=index)
continue
As @damagedcoda say your main error is you should use index instead the value, but you will have new problem if you will recalculate the lower_anomaly_point and upper_anomaly_point inside cycle. So i recommend you to try the np.where to solve your task:
import numpy as np
data_array = np.array([
99.5697438 , 94.47019021, 55., 106.86672855,
102.78730151, 131.85777845, 88.25376895, 96.94439838,
83.67782174, 115.57993209, 118.97651966, 94.40479467,
79.63342207, 77.88602065, 96.59145004, 99.50145353,
97.25980235, 87.72010069, 101.30597215, 87.3110369 ,
110.0687946 , 104.71504012, 89.34719772, 160.,
110.61519268, 112.94716398, 104.41867586])
mean = np.mean(data_array, axis=0)
sd = np.std(data_array, axis=0)
lower_anomaly_point = mean - (3 * sd)
upper_anomaly_point = mean + (3 * sd)
data_array = data_array[
np.where(
(upper_anomaly_point > data_array) & (data_array > lower_anomaly_point)
)]
and result is:
array([ 99.5697438 , 94.47019021, 55. , 106.86672855,
102.78730151, 131.85777845, 88.25376895, 96.94439838,
83.67782174, 115.57993209, 118.97651966, 94.40479467,
79.63342207, 77.88602065, 96.59145004, 99.50145353,
97.25980235, 87.72010069, 101.30597215, 87.3110369 ,
110.0687946 , 104.71504012, 89.34719772, 110.61519268,
112.94716398, 104.41867586])
That code fails for me. The data_array does not change, np.delete returns new array, it does not change old one. You do not use new_array in any place of the code, you probably wanted to calculated mean from new_array The second argument for delete should be index, "indicates which subarray to remove". you cannot use cell.
import numpy as np
data_array = np.array([
99.5697438 , 94.47019021, 55., 106.86672855,
102.78730151, 131.85777845, 88.25376895, 96.94439838,
83.67782174, 115.57993209, 118.97651966, 94.40479467,
79.63342207, 77.88602065, 96.59145004, 99.50145353,
97.25980235, 87.72010069, 101.30597215, 87.3110369 ,
110.0687946 , 104.71504012, 89.34719772, 160.,
110.61519268, 112.94716398, 104.41867586])
mean = np.mean(data_array, axis=0)
sd = np.std(data_array, axis=0)
lower_anomaly_point = mean - (3 * sd)
upper_anomaly_point = mean + (3 * sd)
new_array = data_array.copy()
k = 0
for i, cell in enumerate(data_array):
if cell > upper_anomaly_point or cell < lower_anomaly_point:
print(str(cell) + 'has been removed.')
new_array = np.delete(new_array, i - k)
k += 1
new_array is data_array without 160. as you wished
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.