简体   繁体   中英

How to extract 95% data in python

Given an array of numbers, I would like to drop outliers while preserving 95% of the total number of datapoints. Eg range(0,100,1) would become range(2,98,1).

For example if the data is something like

[0.01,0.02,4,5,7,3,1,4,6,7,10000,10002] -> [4,5,7,3,1,4,6,7]

Is there any function in the Python standard library or Numpy for this purpose?

It sounds like you're interested in filtering out data that's within 95% of the median absolute deviation , or MAD.

The MAD of this dataset is 2.5 (whereas the std deviation is >3000). We can use this to filter points that are more than 2 median deviations away (collecting approx ~95%)

import numpy as np

data = np.array([0.01,0.02,4,5,7,3,1,4,6,7,10000,10002])
deviations = 2

d = np.abs(data - np.median(data))
med_abs_dev = np.median(d)
s = d / med_abs_dev
filtered = data[s < deviations]
# [ 0.01  0.02  4.    5.    7.    3.    1.    4.    6.    7.  ]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM