Identifying and removing outliers based on more than one condition in a dataset using Python

Question

I am preparing a dataset for regression modelling. I would like to remove all outliers prior to doing so. The dataset has 7 variables which are continuous in nature. Five of the variables can be addressed universally. However, two variables need to be divided between male and female participants first, these two variables are height and weight. Clearly these two measurements will differ between males and females, therefore to acquire the outliers I need to differentiate the data by male and females, then assess/remove the outliers across both height and weight for each, then incorporate this data back with the data I have already prepared. Is there a simple way of doing this? I have been using the inter quartile range thus far on the adjacent 5 variables which do not need to be divided by males and females, using this code for each variable...

Q1 = df["Variable"].quantile(0.25)
Q3 = df["Variable"].quantile(0.75)

IQR = Q3-Q1
Lower_Fence = Q1 - (1.5*IQR)
Upper_Fence = Q3 + (1.5*IQR)

print(Lower_Fence)
 print(Upper_Fence)

df[((df["Variable"] < Lower_Fence) | (df["Variable"]  > Upper_Fence))] # Detection of outliers
df[~((df["Variable"] < Lower_Fence) | (df["Variable"]  > Upper_Fence))]` # Removal of outliers

I am relatively new to python.

Answer 1

You can define a function for your "outlier" logic, then apply that repeatedly for all columns, with or without groupby:

def is_outlier(s, quantiles=[.25, .75], thresholds=[-.5, .5]):
    # change the thresholds to [-1.5, 1.5] to reflect IQR as per your question
    a, b = s.quantile(quantiles)
    iqr = b - a
    lo, hi = np.array(thresholds) * iqr + [a, b]
    return (s < lo) | (s > hi)

Simple test:

n = 20
np.random.seed(0)
df = pd.DataFrame(dict(
    status=np.random.choice(['dead', 'alive'], n),
    gender=np.random.choice(['M', 'F'], n),
    weight=np.random.normal(150, 40, n),
    diastolic=np.random.normal(80, 10, n),
    cholesterol=np.random.normal(200, 20, n),
))

Example usage:

mask = is_outlier(df['diastolic'])  # overall outliers
# or
mask = df.groupby('gender')['weight'].apply(is_outlier)  # per gender group

Usage to filter out data:

mask = False

# overall outliers
for k in ['diastolic', 'cholesterol']:  # etc
    mask |= is_outlier(df[k])

# per-gender outliers
gb = df.groupby('gender')
for k in ['weight']:  # and any other columns needed for per-gender
    mask |= gb[k].apply(is_outlier)

# finally, select the non-outliers
df_filtered = df.loc[~mask]

BTW, note how per-gender outliers are different than overall, eg for 'weight':

df.groupby('gender')['weight'].apply(is_outlier) == is_outlier(df['weight'])

Identifying and removing outliers based on more than one condition in a dataset using Python

Question

1 answers

solution1
0 2020-12-07 22:33:43

Identifying and removing outliers based on more than one condition in a dataset using Python

Question

1 answers

solution1 0 2020-12-07 22:33:43

solution1
0 2020-12-07 22:33:43