简体   繁体   English

通过DataFrame在Pandas Group中创建离群值列

[英]Create outliers column in pandas groupby DataFrame

I have a very large pandas DataFrame with several thousand codes and the cost associated with each one of them (sample): 我有一个很大的pandas DataFrame,上面有几千个代码,每个代码的成本(样本):

data = {'code': ['a', 'b', 'a', 'c', 'c', 'c', 'c'],
        'cost': [10, 20, 100, 10, 10, 500, 10]}
df = pd.DataFrame(data)

I am creating a groupby object at the code level, ie,: 我正在code级别创建一个groupby对象,即:

grouped = df.groupby('code')['cost'].agg(['sum', 'mean']).apply(pd.Series)

Now I really need to add a new column to this grouped DataFrame, determining the percentage of codes that have outlier costs. 现在,我确实需要向此grouped DataFrame中添加新列,以确定具有异常成本的代码的百分比。 My initial approach was this external function (using iqr from scipy ): 我最初的方法是使用iqr函数(使用iqrscipy ):

def is_outlier(s):
    # Only calculate outliers when we have more than 100 observations
    if s.count() >= 100:
        return np.where(s >= s.quantile(0.75) + 1.5 * iqr(s), 1, 0).mean()
    else:
        return np.nan

Having written this function, I added is_outlier to my agg arguments in the groupby above. 编写is_outlier此函数后,我在上面的groupby is_outlier添加到了我的agg参数中。 This did not work, because I am trying to evaluate this is_outlier rate for every element in the cost series: 这没有用,因为我正在尝试评估cost序列中每个元素的is_outlier比率:

grouped = df.groupby('code')['cost'].agg(['sum', 'mean', is_outlier]).apply(pd.Series)

I attempted to use pd.Series.where but it does not have the same functionality as the np.where . 我尝试使用pd.Series.where但是它没有与np.where相同的功能。 Is there a way to modify my is_outlier function that has to take the cost series as argument in order to correctly evaluate the outliers rate for each code? 有没有办法修改必须以cost系列作为参数的is_outlier函数,以便正确评估每个代码的离群率? Or am I completely off-path? 还是我完全偏离道路?

UPDATE Desired Result (minus the minimum observations requirement for this example): UPDATE期望的结果(减去此示例的最低观测值要求):

>>> grouped

  code    sum    mean    is_outlier

0  'a'    110     55     0.5
1  'b'    20      20     0
2  'c'    530     132.5  0.25

Note: my sample is terrible in order for me to calculate outliers since I have 2, 1, and 4 observations respectively for each code . 注意:由于每个code分别有2个,1个和4个观测值,因此我的样本很糟糕,无法计算异常值。 In the production data frame each code has hundreds or thousands of observations, each one with a cost associated. 在生产数据帧中,每个代码都有数百或数千个观测值,每个观测值都有相关的成本。 In the sample result above, the values for is_outlier mean that, for 'a' one out of the two observations has a cost in the outlier range, for 'c' one out of the four observations has a cost in the outlier range, etc - I am trying to recreate this in my function by assigning 1's and 0's as the result of np.where() and taking the .mean() of that 在上面的样本结果中, is_outlier值表示,对于'a' ,两个观察值中的一个具有成本在离群值范围内,对于'c' ,四个观察值中的一个,其成本位于离群值范围内, is_outlier -我想通过分配1和0的结果来重建这在我的功能np.where()和服用.mean()的那

.apply(pd.Series) is needed in order to cast the <pandas.core.groupby.SeriesGroupBy object> resulting from groupby into a DataFrame. .apply(pd.Series)是必需的,以便将groupby <pandas.core.groupby.SeriesGroupBy object> resulting from<pandas.core.groupby.SeriesGroupBy object> resulting from into a DataFrame. s is a pandas Series with all values of cost for each code , as generated from the groupby operation ( split phase of split-apply-combine`) s is a pandas Series with all values of for each代码的is a pandas Series with all values of成本is a pandas Series with all values of , as generated from the groupby operation ( split-apply-combine`的拆分phase of )生成的

Data used 使用数据

# Loading Libraries
import pandas as pd;
import numpy as np;

# Creating Data set
data = {'code': ['a', 'b', 'a', 'c', 'c', 'c', 'c', 'a', 'a', 'a'],
    'cost': [10, 20, 200, 10, 10, 500, 10, 10, 10, 10]}

df = pd.DataFrame(data)

Defining a function for calculating the proportion of outliers in a specified column 定义用于计算指定列中离群值比例的函数

def outlier_prop(df,name,group_by):

    """
    @Packages required
    import pandas as pd;
    import numpy as np;

    @input
    df = original dataframe
    name = This is the name column for which you want the dummy list
    group_by = column to group by

    @output
    data frame with an added column 'outlier' containing the proportion of outliers
    """

    # Step 1: Create a dict of values for each group
    value_dict = dict()
    for index,i in enumerate(df[group_by]):
        if i not in value_dict.keys():
            value_dict[i] = [df[name][index]]
        else:
            value_dict[i].append(df[name][index])

    # Step 2: Calculate the outlier value for each group and store as a dict
    outlier_thres_dict = dict()
    unique_groups = set(df[group_by])
    for i in unique_groups:
        outlier_threshold = np.mean(value_dict[i]) + 1.5*np.std(value_dict[i])
        outlier_thres_dict[i] = outlier_threshold

    # Step 3: Create a list indicating values greater than the group specific threshold
    dummy_list = []
    for index,i in enumerate(df[group_by]):
        if df[name][index] > outlier_thres_dict[i]:
            dummy_list.append(1)
        else:
            dummy_list.append(0)

    # Step 4: Add the list to the original dataframe
    df['outlier'] = dummy_list

    # Step 5: Grouping and getting the proportion of outliers
    grouped = df.groupby(group_by).agg(['sum', 'mean']).apply(pd.Series)

    # Step 6: Return data frame
    return grouped

Calling the function 调用函数

outlier_prop(df, 'cost', 'code')

Output 输出量

https://raw.githubusercontent.com/magoavi/stackoverflow/master/50533570.png https://raw.githubusercontent.com/magoavi/stackoverflow/master/50533570.png

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM