简体   繁体   English

使用 Python 根据数据集中的多个条件识别和删除异常值

[英]Identifying and removing outliers based on more than one condition in a dataset using Python

I am preparing a dataset for regression modelling.我正在为回归建模准备一个数据集。 I would like to remove all outliers prior to doing so.我想在这样做之前删除所有异常值。 The dataset has 7 variables which are continuous in nature.该数据集有 7 个本质上是连续的变量。 Five of the variables can be addressed universally.其中五个变量可以普遍解决。 However, two variables need to be divided between male and female participants first, these two variables are height and weight.但是,首先需要在男性和女性参与者之间划分两个变量,这两个变量是身高和体重。 Clearly these two measurements will differ between males and females, therefore to acquire the outliers I need to differentiate the data by male and females, then assess/remove the outliers across both height and weight for each, then incorporate this data back with the data I have already prepared.显然这两个测量值在男性和女性之间会有所不同,因此为了获取异常值,我需要区分男性和女性的数据,然后评估/删除每个人的身高和体重的异常值,然后将这些数据与我的数据合并已经准备好了。 Is there a simple way of doing this?有没有一种简单的方法可以做到这一点? I have been using the inter quartile range thus far on the adjacent 5 variables which do not need to be divided by males and females, using this code for each variable...到目前为止,我一直在相邻的 5 个变量上使用四分位数范围,这些变量不需要除以男性和女性,对每个变量使用这个代码......

Q1 = df["Variable"].quantile(0.25)
Q3 = df["Variable"].quantile(0.75)

IQR = Q3-Q1
Lower_Fence = Q1 - (1.5*IQR)
Upper_Fence = Q3 + (1.5*IQR)

print(Lower_Fence)
 print(Upper_Fence)

df[((df["Variable"] < Lower_Fence) | (df["Variable"]  > Upper_Fence))] # Detection of outliers
df[~((df["Variable"] < Lower_Fence) | (df["Variable"]  > Upper_Fence))]` # Removal of outliers

I am relatively new to python.我对 python 比较陌生。

我正在使用的数据的图片

You can define a function for your "outlier" logic, then apply that repeatedly for all columns, with or without groupby:您可以为您的“异常值”逻辑定义 function,然后将其重复应用于所有列,无论是否使用 groupby:

def is_outlier(s, quantiles=[.25, .75], thresholds=[-.5, .5]):
    # change the thresholds to [-1.5, 1.5] to reflect IQR as per your question
    a, b = s.quantile(quantiles)
    iqr = b - a
    lo, hi = np.array(thresholds) * iqr + [a, b]
    return (s < lo) | (s > hi)

Simple test:简单测试:

n = 20
np.random.seed(0)
df = pd.DataFrame(dict(
    status=np.random.choice(['dead', 'alive'], n),
    gender=np.random.choice(['M', 'F'], n),
    weight=np.random.normal(150, 40, n),
    diastolic=np.random.normal(80, 10, n),
    cholesterol=np.random.normal(200, 20, n),
))

Example usage:示例用法:

mask = is_outlier(df['diastolic'])  # overall outliers
# or
mask = df.groupby('gender')['weight'].apply(is_outlier)  # per gender group

Usage to filter out data:过滤数据的用法:

mask = False

# overall outliers
for k in ['diastolic', 'cholesterol']:  # etc
    mask |= is_outlier(df[k])

# per-gender outliers
gb = df.groupby('gender')
for k in ['weight']:  # and any other columns needed for per-gender
    mask |= gb[k].apply(is_outlier)

# finally, select the non-outliers
df_filtered = df.loc[~mask]

BTW, note how per-gender outliers are different than overall, eg for 'weight':顺便说一句,请注意每个性别的异常值与整体有何不同,例如“体重”:

df.groupby('gender')['weight'].apply(is_outlier) == is_outlier(df['weight'])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM