简体   繁体   English

如何用 Python 中的第 95 个和第 5 个百分位数替换异常值?

[英]How to replace the outliers with the 95th and 5th percentile in Python?

I am trying to do an outlier treatment on my time series data where I want to replace the values > 95th percentile with the 95th percentile and the values < 5th percentile with the 5th percentile value.我正在尝试对我的时间序列数据进行异常值处理,我想用 95% 的值替换 > 95% 的值,用 5% 的值替换 < 5% 的值。 I have prepared some code but I am unable to find the desired result.我已经准备了一些代码,但我无法找到想要的结果。

I am trying to create a OutlierTreatment function using a sub- function called Cut.我正在尝试使用名为 Cut 的子函数创建一个 OutlierTreatment 函数。 The code is given below代码如下

def outliertreatment(df,high_limit,low_limit):
    df_temp=df['y'].apply(cut,high_limit,low_limit, extra_kw=1)
    return df_temp
def cut(column,high_limit,low_limit):
    conds = [column > np.percentile(column, high_limit),
             column < np.percentile(column, low_limit)]
    choices = [np.percentile(column, high_limit),
            np.percentile(column, low_limit)]
    return np.select(conds,choices,column)  

I expect to send the dataframe, 95 as high_limit and 5 as low_limit in the OutlierTreatment function.我希望在 OutlierTreatment 函数中发送数据帧,95 作为 high_limit 和 5 作为 low_limit。 How to achieve the desired result?如何达到预期的效果?

I'm not sure if this approach is a suitable way to deal with outliers, but to achieve what you want, clip function is useful.我不确定这种方法是否适合处理异常值,但要实现您想要的效果, clip函数很有用。 It assigns values outside boundary to boundary values.它将边界外的值分配给边界值。 You can read more in documentation .您可以在文档中阅读更多内容

data=pd.Series(np.random.randn(100))
data.clip(lower=data.quantile(0.05), upper=data.quantile(0.95))

If your data contains multiple columns如果您的数据包含多列

For individual column对于单个列

p_05 = df['sales'].quantile(0.05) # 5th quantile
p_95 = df['sales'].quantile(0.95) # 95th quantile

df['sales'].clip(p_05, p_95, inplace=True)

For more than one numerical columns:对于不止一个数字列:

num_col = df.select_dtypes(include=['int64','float64']).columns.tolist()

# or you can create a custom list of numerical columns

df[num_col] = df[num_col].apply(lambda x: x.clip(*x.quantile([0.05, 0.95])))

Bonus:奖金:

To check outliers using box plot使用箱线图检查异常值

import matplotlib.pyplot as plt

for x in num_col:
    df[num_col].boxplot(x)
    plt.figure()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM