Outliers formula for columns in pandas

Question

I have an example of numerical column in a df having 10 other columns (both numerical and categorical):

I would like to apply the formulas:

Low outlier: q1-(1.5*iqr)
High outlier: q3+(1.5*iqr)

I know that there is iqr in scipy : from scipy.stats import iqr and using numpy I can calculate q1 (first quartile) and q3 (third quartile) as follows:

from scipy.stats import iqr
import numpy as np

q1=np.percentile(df.Units, 25)
q3=np.percentile(df.Units, 75)

mask = df['Units'].between(q1, q3, inclusive=True)
iqr = df.loc[mask, 'Units']

However there are two missing steps in calculating the formulas above:

handle NaN values: I would prefer to not remove them from my column, but only to exclude them from calculations;
correctly apply the formulas

Low outlier: q1-(1.5*iqr)

High outlier: q3+(1.5*iqr)

I would say that maybe it could be possible by using between or just filtering values lower/higher than values calculated from the formulas above.

Answer 1

Lazy attempt with df.query , excludes NaN s from calculations and keep them in the final query:

from scipy.stats import iqr

df = pd.DataFrame([-12,4,4,5,1,5,12,6,34,6,7,np.nan, 12,745], columns=["Units"])

q1 = np.nanpercentile(df.Units, 25)
q3 = np.nanpercentile(df.Units, 75)
IQR = iqr(df.Units, nan_policy="omit")

df.query("Units.isnull() or ((@q1 - 1.5*@IQR) <= Units <= (@q3 + 1.5*@IQR))")

which gives the (hopefully) expected result:

Outliers formula for columns in pandas

Question

1 answers

solution1
1 ACCPTED 2021-04-06 10:24:55

Outliers formula for columns in pandas

Question

1 answers

solution1 1 ACCPTED 2021-04-06 10:24:55

solution1
1 ACCPTED 2021-04-06 10:24:55