简体   繁体   中英

Python Pandas - Faster Way to Iterate Through Categories in Data While Removing Outliers (Without For Loop)

Suppose I have a dataframe like such:

import pandas as pd
import numpy as np

data = [[5123, '2021-01-01 00:00:00', 'cash','sales$', 105],
        [5123, '2021-01-01 00:00:00', 'cash','items', 20],
        [5123, '2021-01-01 00:00:00', 'card','sales$', 190],
        [5123, '2021-01-01 00:00:00', 'card','items', 40],
        [5123, '2021-01-02 00:00:00', 'cash','sales$', 75],
        [5123, '2021-01-02 00:00:00', 'cash','items', 10],
        [5123, '2021-01-02 00:00:00', 'card','sales$', 170],
        [5123, '2021-01-02 00:00:00', 'card','items', 35],
        [5123, '2021-01-03 00:00:00', 'cash','sales$', 1000],
        [5123, '2021-01-03 00:00:00', 'cash','items', 500],
        [5123, '2021-01-03 00:00:00', 'card','sales$', 150],
        [5123, '2021-01-03 00:00:00', 'card','items', 20]]

columns = ['Store', 'Date', 'Payment Method', 'Attribute', 'Value']

df = pd.DataFrame(data = data, columns = columns)

Store Date Payment Method Attribute Value
5123 2021-01-01 00:00:00 cash sales$ 105
5123 2021-01-01 00:00:00 cash items 20
5123 2021-01-01 00:00:00 card sales$ 190
5123 2021-01-01 00:00:00 card items 40
5123 2021-01-02 00:00:00 cash sales$ 75
5123 2021-01-02 00:00:00 cash items 10
5123 2021-01-02 00:00:00 card sales$ 170
5123 2021-01-02 00:00:00 card items 35
5123 2021-01-03 00:00:00 cash sales$ 1000
5123 2021-01-03 00:00:00 cash items 500
5123 2021-01-03 00:00:00 card sales$ 150
5123 2021-01-03 00:00:00 card items 20

I would like to filter outliers and replace them with the average value from the preceding 2 days. My "outlier rule" is such: if a value for an attribute/payment method is more than twice as big, or smaller than half as big, as the average value for that attribute/payment method from the preceding two days, then replace that outlier with the average value from the preceding two days. Otherwise, leave the value. In this case, all values should remain except for the $1000 sales and 500 items for 5123/'2021-01-03'/'cash'. Those values should be replaced with $90 for sales, and 15 for items.

Here is my attempt (using a for loop, which doesn't work). Whenever I am using a loop and Pandas together, a red flag goes off in my head. What is the correct way to do this?

stores = df['Store'].unique()
payment_methods = df['Payment Method'].unique()
attributes = df['Attribute'].unique()

df_no_outliers = pd.DataFrame()

for store in stores:
    for payment_method in payment_methods:
        for attribute in attributes:

            df_temp = df.loc[df['Store'] == store]
            df_temp = df_temp.loc[df_temp['Payment Method'] == payment_method]
            df_temp = df_temp.loc[df_temp['Attribute'] == attribute]

            df_temp['Value'] = np.where(df_temp['Value'] <= (df_temp['Value'].shift(-1)
                                                                +df_temp['Value'].shift(-2))*2/2,
                                         df_temp['Value'],
                                        (df_temp['Value'].shift(-1)+df_temp['Value'].shift(-2))/2)

            df_temp['Value'] = np.where(df_temp['Value'] >= (df_temp['Value'].shift(-1)
                                                                +df_temp['Value'].shift(-2))*0.5/2,
                                         df_temp['Value'],
                                        (df_temp['Value'].shift(-1)+df_temp['Value'].shift(-2))/2)


            df_no_outliers = df_no_outliers.append(df_temp)

In case anyone is curious why I'm using this rolling average method instead of something like Tukey's method of cutting off data more or less than 1.5*IQR away from 1Q and 3Q, my data is timeseries over the period of COVID, which means that the IQR is very large (high sales before COVID, then a deep pit of lack of sales after), so the IQR ends up not filtering anything. I do not want to remove the COVID drop, but rather remove some erroneous data entry failures (some stores are bad about this, and may enter a few extra zeroes on some days...). Instead of using the last two days as a rolling filter, I will probably end up using 5 or 7 days (for a week). I am also open to other ways of doing this cleanup / outlier removal.

Try:

#groupby the required columns and compute the rolling 2-day average
average = (df.groupby(["Store","Payment Method","Attribute"], as_index=False)
           .apply(lambda x: x["Value"].rolling(2).mean().shift())
           .droplevel(0)
           )

#divide values by the average and keep only those ratios that fall between 0.5 and 2
output = df[df["Value"].div(average).fillna(1).between(0.5,2)]

>>> output
    Store                 Date Payment Method Attribute  Value
0    5123  2021-01-01 00:00:00           cash    sales$    105
1    5123  2021-01-01 00:00:00           cash     items     20
2    5123  2021-01-01 00:00:00           card    sales$    190
3    5123  2021-01-01 00:00:00           card     items     40
4    5123  2021-01-02 00:00:00           cash    sales$     75
5    5123  2021-01-02 00:00:00           cash     items     10
6    5123  2021-01-02 00:00:00           card    sales$    170
7    5123  2021-01-02 00:00:00           card     items     35
10   5123  2021-01-03 00:00:00           card    sales$    150
11   5123  2021-01-03 00:00:00           card     items     20

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM