Python Pandas - 在删除异常值的同时更快地迭代数据中的类别（没有 For 循环）

Question

假设我有一个 dataframe 像这样：

import pandas as pd
import numpy as np

data = [[5123, '2021-01-01 00:00:00', 'cash','sales$', 105],
        [5123, '2021-01-01 00:00:00', 'cash','items', 20],
        [5123, '2021-01-01 00:00:00', 'card','sales$', 190],
        [5123, '2021-01-01 00:00:00', 'card','items', 40],
        [5123, '2021-01-02 00:00:00', 'cash','sales$', 75],
        [5123, '2021-01-02 00:00:00', 'cash','items', 10],
        [5123, '2021-01-02 00:00:00', 'card','sales$', 170],
        [5123, '2021-01-02 00:00:00', 'card','items', 35],
        [5123, '2021-01-03 00:00:00', 'cash','sales$', 1000],
        [5123, '2021-01-03 00:00:00', 'cash','items', 500],
        [5123, '2021-01-03 00:00:00', 'card','sales$', 150],
        [5123, '2021-01-03 00:00:00', 'card','items', 20]]

columns = ['Store', 'Date', 'Payment Method', 'Attribute', 'Value']

df = pd.DataFrame(data = data, columns = columns)

店铺	日期	付款方式	属性	价值
5123	2021-01-01 00:00:00	现金	销售额$	105
5123	2021-01-01 00:00:00	现金	项目	20
5123	2021-01-01 00:00:00	卡片	销售额$	190
5123	2021-01-01 00:00:00	卡片	项目	40
5123	2021-01-02 00:00:00	现金	销售额$	75
5123	2021-01-02 00:00:00	现金	项目	10
5123	2021-01-02 00:00:00	卡片	销售额$	170
5123	2021-01-02 00:00:00	卡片	项目	35
5123	2021-01-03 00:00:00	现金	销售额$	1000
5123	2021-01-03 00:00:00	现金	项目	500
5123	2021-01-03 00:00:00	卡片	销售额$	150
5123	2021-01-03 00:00:00	卡片	项目	20

我想过滤异常值并将它们替换为前 2 天的平均值。 我的“异常值规则”是这样的：如果属性/支付方式的值是前两天该属性/支付方式的平均值的两倍多或小于一半，则替换它与前两天的平均值的异常值。 否则，保留该值。 在这种情况下，除了 5123/'2021-01-03'/'cash' 的 1000 美元销售额和 500 件商品外，所有值都应保留。 这些值应替换为 90 美元的销售额和 15 美元的商品。

这是我的尝试（使用 for 循环，它不起作用）。 每当我同时使用循环和 Pandas 时，我的脑海中就会出现红旗。 这样做的正确方法是什么？

stores = df['Store'].unique()
payment_methods = df['Payment Method'].unique()
attributes = df['Attribute'].unique()

df_no_outliers = pd.DataFrame()

for store in stores:
    for payment_method in payment_methods:
        for attribute in attributes:

            df_temp = df.loc[df['Store'] == store]
            df_temp = df_temp.loc[df_temp['Payment Method'] == payment_method]
            df_temp = df_temp.loc[df_temp['Attribute'] == attribute]

            df_temp['Value'] = np.where(df_temp['Value'] <= (df_temp['Value'].shift(-1)
                                                                +df_temp['Value'].shift(-2))*2/2,
                                         df_temp['Value'],
                                        (df_temp['Value'].shift(-1)+df_temp['Value'].shift(-2))/2)

            df_temp['Value'] = np.where(df_temp['Value'] >= (df_temp['Value'].shift(-1)
                                                                +df_temp['Value'].shift(-2))*0.5/2,
                                         df_temp['Value'],
                                        (df_temp['Value'].shift(-1)+df_temp['Value'].shift(-2))/2)


            df_no_outliers = df_no_outliers.append(df_temp)

如果有人好奇我为什么要使用这种滚动平均方法，而不是像 Tukey 的方法将数据从 1Q 和 3Q 截断多于或少于 1.5*IQR 的方法，我的数据是 COVID 期间的时间序列，这意味着IQR 非常大（在 COVID 之前销量很高，之后是销售不足的深坑），因此 IQR 最终没有过滤任何东西。 我不想删除 COVID 删除，而是删除一些错误的数据输入失败（有些商店对此不好，并且可能会在某些日子输入一些额外的零......）。 我最终可能会使用 5 或 7 天（一周），而不是使用最后两天作为滚动过滤器。 我也愿意接受其他方式来进行这种清理/异常值删除。

Answer 1

尝试：

#groupby the required columns and compute the rolling 2-day average
average = (df.groupby(["Store","Payment Method","Attribute"], as_index=False)
           .apply(lambda x: x["Value"].rolling(2).mean().shift())
           .droplevel(0)
           )

#divide values by the average and keep only those ratios that fall between 0.5 and 2
output = df[df["Value"].div(average).fillna(1).between(0.5,2)]

>>> output
    Store                 Date Payment Method Attribute  Value
0    5123  2021-01-01 00:00:00           cash    sales$    105
1    5123  2021-01-01 00:00:00           cash     items     20
2    5123  2021-01-01 00:00:00           card    sales$    190
3    5123  2021-01-01 00:00:00           card     items     40
4    5123  2021-01-02 00:00:00           cash    sales$     75
5    5123  2021-01-02 00:00:00           cash     items     10
6    5123  2021-01-02 00:00:00           card    sales$    170
7    5123  2021-01-02 00:00:00           card     items     35
10   5123  2021-01-03 00:00:00           card    sales$    150
11   5123  2021-01-03 00:00:00           card     items     20

Python Pandas - 在删除异常值的同时更快地迭代数据中的类别（没有 For 循环）

问题描述

1 个解决方案

解决方案1
0 2021-12-03 22:05:51

Python Pandas - 在删除异常值的同时更快地迭代数据中的类别（没有 For 循环）

问题描述

1 个解决方案

解决方案1 0 2021-12-03 22:05:51

解决方案1
0 2021-12-03 22:05:51