Python Pandas - Faster Way to Iterate Through Categories in Data While Removing Outliers (Without For Loop)

Question

Suppose I have a dataframe like such:

import pandas as pd
import numpy as np

data = [[5123, '2021-01-01 00:00:00', 'cash','sales$', 105],
        [5123, '2021-01-01 00:00:00', 'cash','items', 20],
        [5123, '2021-01-01 00:00:00', 'card','sales$', 190],
        [5123, '2021-01-01 00:00:00', 'card','items', 40],
        [5123, '2021-01-02 00:00:00', 'cash','sales$', 75],
        [5123, '2021-01-02 00:00:00', 'cash','items', 10],
        [5123, '2021-01-02 00:00:00', 'card','sales$', 170],
        [5123, '2021-01-02 00:00:00', 'card','items', 35],
        [5123, '2021-01-03 00:00:00', 'cash','sales$', 1000],
        [5123, '2021-01-03 00:00:00', 'cash','items', 500],
        [5123, '2021-01-03 00:00:00', 'card','sales$', 150],
        [5123, '2021-01-03 00:00:00', 'card','items', 20]]

columns = ['Store', 'Date', 'Payment Method', 'Attribute', 'Value']

df = pd.DataFrame(data = data, columns = columns)

Store	Date	Payment Method	Attribute	Value
5123	2021-01-01 00:00:00	cash	sales$	105
5123	2021-01-01 00:00:00	cash	items	20
5123	2021-01-01 00:00:00	card	sales$	190
5123	2021-01-01 00:00:00	card	items	40
5123	2021-01-02 00:00:00	cash	sales$	75
5123	2021-01-02 00:00:00	cash	items	10
5123	2021-01-02 00:00:00	card	sales$	170
5123	2021-01-02 00:00:00	card	items	35
5123	2021-01-03 00:00:00	cash	sales$	1000
5123	2021-01-03 00:00:00	cash	items	500
5123	2021-01-03 00:00:00	card	sales$	150
5123	2021-01-03 00:00:00	card	items	20

I would like to filter outliers and replace them with the average value from the preceding 2 days. My "outlier rule" is such: if a value for an attribute/payment method is more than twice as big, or smaller than half as big, as the average value for that attribute/payment method from the preceding two days, then replace that outlier with the average value from the preceding two days. Otherwise, leave the value. In this case, all values should remain except for the $1000 sales and 500 items for 5123/'2021-01-03'/'cash'. Those values should be replaced with $90 for sales, and 15 for items.

Here is my attempt (using a for loop, which doesn't work). Whenever I am using a loop and Pandas together, a red flag goes off in my head. What is the correct way to do this?

stores = df['Store'].unique()
payment_methods = df['Payment Method'].unique()
attributes = df['Attribute'].unique()

df_no_outliers = pd.DataFrame()

for store in stores:
    for payment_method in payment_methods:
        for attribute in attributes:

            df_temp = df.loc[df['Store'] == store]
            df_temp = df_temp.loc[df_temp['Payment Method'] == payment_method]
            df_temp = df_temp.loc[df_temp['Attribute'] == attribute]

            df_temp['Value'] = np.where(df_temp['Value'] <= (df_temp['Value'].shift(-1)
                                                                +df_temp['Value'].shift(-2))*2/2,
                                         df_temp['Value'],
                                        (df_temp['Value'].shift(-1)+df_temp['Value'].shift(-2))/2)

            df_temp['Value'] = np.where(df_temp['Value'] >= (df_temp['Value'].shift(-1)
                                                                +df_temp['Value'].shift(-2))*0.5/2,
                                         df_temp['Value'],
                                        (df_temp['Value'].shift(-1)+df_temp['Value'].shift(-2))/2)


            df_no_outliers = df_no_outliers.append(df_temp)

In case anyone is curious why I'm using this rolling average method instead of something like Tukey's method of cutting off data more or less than 1.5*IQR away from 1Q and 3Q, my data is timeseries over the period of COVID, which means that the IQR is very large (high sales before COVID, then a deep pit of lack of sales after), so the IQR ends up not filtering anything. I do not want to remove the COVID drop, but rather remove some erroneous data entry failures (some stores are bad about this, and may enter a few extra zeroes on some days...). Instead of using the last two days as a rolling filter, I will probably end up using 5 or 7 days (for a week). I am also open to other ways of doing this cleanup / outlier removal.

Answer 1

Try:

#groupby the required columns and compute the rolling 2-day average
average = (df.groupby(["Store","Payment Method","Attribute"], as_index=False)
           .apply(lambda x: x["Value"].rolling(2).mean().shift())
           .droplevel(0)
           )

#divide values by the average and keep only those ratios that fall between 0.5 and 2
output = df[df["Value"].div(average).fillna(1).between(0.5,2)]

>>> output
    Store                 Date Payment Method Attribute  Value
0    5123  2021-01-01 00:00:00           cash    sales$    105
1    5123  2021-01-01 00:00:00           cash     items     20
2    5123  2021-01-01 00:00:00           card    sales$    190
3    5123  2021-01-01 00:00:00           card     items     40
4    5123  2021-01-02 00:00:00           cash    sales$     75
5    5123  2021-01-02 00:00:00           cash     items     10
6    5123  2021-01-02 00:00:00           card    sales$    170
7    5123  2021-01-02 00:00:00           card     items     35
10   5123  2021-01-03 00:00:00           card    sales$    150
11   5123  2021-01-03 00:00:00           card     items     20

Python Pandas - Faster Way to Iterate Through Categories in Data While Removing Outliers (Without For Loop)

Question

1 answers

solution1
0 2021-12-03 22:05:51

Python Pandas - Faster Way to Iterate Through Categories in Data While Removing Outliers (Without For Loop)

Question

1 answers

solution1 0 2021-12-03 22:05:51

solution1
0 2021-12-03 22:05:51