简体   繁体   中英

Outliers formula for columns in pandas

I have an example of numerical column in a df having 10 other columns (both numerical and categorical):

Units
-12
4
4
5
1
5
12
6
34
6
7
12
745

I would like to apply the formulas:

Low outlier: q1-(1.5*iqr)
High outlier: q3+(1.5*iqr)

I know that there is iqr in scipy : from scipy.stats import iqr and using numpy I can calculate q1 (first quartile) and q3 (third quartile) as follows:

from scipy.stats import iqr
import numpy as np

q1=np.percentile(df.Units, 25)
q3=np.percentile(df.Units, 75)

mask = df['Units'].between(q1, q3, inclusive=True)
iqr = df.loc[mask, 'Units']

However there are two missing steps in calculating the formulas above:

  • handle NaN values: I would prefer to not remove them from my column, but only to exclude them from calculations;
  • correctly apply the formulas

Low outlier: q1-(1.5*iqr)

High outlier: q3+(1.5*iqr)

I would say that maybe it could be possible by using between or just filtering values lower/higher than values calculated from the formulas above.

Lazy attempt with df.query , excludes NaN s from calculations and keep them in the final query:

from scipy.stats import iqr

df = pd.DataFrame([-12,4,4,5,1,5,12,6,34,6,7,np.nan, 12,745], columns=["Units"])

q1 = np.nanpercentile(df.Units, 25)
q3 = np.nanpercentile(df.Units, 75)
IQR = iqr(df.Units, nan_policy="omit")

df.query("Units.isnull() or ((@q1 - 1.5*@IQR) <= Units <= (@q3 + 1.5*@IQR))")

which gives the (hopefully) expected result:

    Units
1     4.0
2     4.0
3     5.0
4     1.0
5     5.0
6    12.0
7     6.0
9     6.0
10    7.0
11    NaN
12   12.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM