简体   繁体   中英

Aggregating on the aggregated values in pandas returns wrong result

I have such dataframe - each transaction might appear more than one, and transactions linked to stores. I want to find mean value of transaction. For that I need to sum and then find mean value:

#preparind dataset
txt_data = pandas.read_csv("./TestDataSource/txn.csv", sep = ';')
txt_data = txt_data.replace({',': '.'}, regex=True)
txt_data[['SALES']] = txt_data[[ 'SALES']].apply(pd.to_numeric)

在此输入图像描述

We have len(txt_data.STORE.unique()) only 30 unique STORE here.

First of all I aggregate on transactions:

a1 = txt_data[['STORE', 'SALES', 'TXN']].groupby('TXN').sum()[['STORE', 'SALES']]
a.head()

在此输入图像描述

Everything seems to be okay. But then I aggregate on stores:

a2 = a1.groupby('STORE').mean()
[![enter image description here][3]][3]

But... list(a2.shape) - returns [1137, 1] . Thats really confusing. But moreover len(a1.STORE.unique()) returns 1137

What am I doing wrong

There is problem you aggregate STORE and SALES columns by sum per TXN column:

a1 = txt_data[['STORE', 'SALES', 'TXN']].groupby('TXN').sum()[['STORE', 'SALES']]

What is same as:

a1 = txt_data.groupby('TXN')['STORE', 'SALES'].sum()

But if aggregate by column TXT and STORE all working nice:

txt_data = pd.read_csv("txn.csv", sep = ';', decimal=',')

a1 = txt_data.groupby(['TXN', 'STORE'], as_index=False)['SALES'].sum()

print (txt_data.STORE.nunique())
30

print (a1.STORE.nunique())
30

In the line

a1 = txt_data[['STORE', 'SALES', 'TXN']].groupby('TXN').sum()[['STORE', 'SALES']]

your are grouping your dataframe by TXN , but tell pandas to sum all other columns, so you get the store-ids summed and "new stores created", eg:

txt_data[txt_data['TXN']==5541359000]  

               DAY  STORE   ART                    TXN      TIME    SALES
1268877 2015-10-01  1082    15294488        5541359000  09:30:22    60.2
1269093 2015-10-01  1082    80439           5541359000  09:30:29    15.6
1269309 2015-10-01  1082    191452          5541359000  09:30:15    4.0
1269525 2015-10-01  1082    15317962        5541359000  09:30:17    103.0

a1.head()
           STORE    SALES
TXN     
5541359000  4328    182.8

#1082 * 4 = 4328

I think your problem is happened when you used this line,

a1 = txt_data[['STORE', 'SALES', 'TXN']].groupby('TXN').sum()

When I get unique values using txt_data['STORE'].unique() it outputs,

array([22691, 20581,  1574,  1602,  1579, 29245, 19009, 21761, 17474,
        1544,  1612,  1534,   958, 17096,  1094,  1596,  1594,  1609,
       24605,   956,   961,  1122, 27220,   974,  1082, 25039,  1530,
         999,  1053,   980])

But In the a1 Dataframe, STORE values are diffrent than txt_data because group_by.sum() summed values in STORE inorder to get unique 'TXN's. 在此输入图像描述

See : There is no STORE = 4328 in txt_data['STORE'].unique()

在此输入图像描述

1082 * 4 = 4328

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM