My question is somewhat similar to this one , but not quite. I have a CSV with the following kind of structure
| id | entrydate | sales | purchases |
| -- | -----------| ----- | --------- |
| 1 | 05/03/2017 | 10 | 1 |
| 2 | 05/03/2017 | 20 | 2 |
| 3 | 05/03/2017 | 30 | 3 |
| 1 | 05/03/2017 | 40 | 1 |
I'm reading this into a dataframe, and I want to get daily aggregates of sales and purchases (individual id doesn't matter, just daily aggregates).
First, however, I need to remove duplicates. This is tripping me up, because if you take the example above, for id 1 , there are two entries on the same day, but multiple entries in the purchases
column are to be considered duplicates, whereas multiple entries in the sales
column are valid, so the correct grouping would result in
| id | entrydate | sales | purchases |
| -- | -----------| ----- | --------- |
| 1 | 05/03/2017 | 50 | 1 |
| 2 | 05/03/2017 | 20 | 2 |
| 3 | 05/03/2017 | 30 | 3 |
and then getting the daily aggregate would give me
|entrydate | sales | purchases |
| -----------| ----- | --------- |
| 05/03/2017 | 100 | 6 |
I was trying to remove the purchases
duplicates using
df = pandas.read_csv('../my-csv.csv', parse_dates=True, dayfirst=True, usecols=my_columns, dtype=my_dtypes).rename(columns=str.lower).assign(date=lambda x: pd.to_datetime(x['entrydate'], format="%d/%m/%Y")).set_index('date')
df = df.drop_duplicates(['id', 'entrydate', 'purchases'])
df.drop(['id'], axis=1, inplace=True)
df = df.groupby(pd.TimeGrouper(freq='D')).sum()
but while this will remove the duplicate purchases
it also removes valid sales
Image for the solution by A-Za-z
If you groupby entrydate you can aggregate both sales and purchases:
In [11]: df.groupby("entrydate").agg({"sales": "sum", "purchases": "sum"})
Out[11]:
sales purchases
entrydate
05/03/2017 100 7
You can use groupby twice, first to aggregate sales
df.sales = df.groupby('id').sales.transform('sum')
df = df.drop_duplicates()
df.groupby(df.entrydate).sum().reset_index()
entrydate sales purchases
0 2017-05-03 100 6
EDIT: To account for sum over different dates
df.sales = df.groupby(['id', 'date']).sales.transform('sum')
df = df.drop_duplicates()
df.groupby('date')['sales', 'purchases'].sum().reset_index()
You get
date sales purchases
0 2017-03-05 100 6
1 2017-03-06 40 1
Setup
df = pd.DataFrame({'entrydate': {0: '05/03/2017',
1: '05/03/2017',
2: '05/03/2017',
3: '05/03/2017',
4: '06/03/2017',
5: '06/03/2017',
6: '06/03/2017',
7: '06/03/2017'},
'id': {0: 1, 1: 2, 2: 3, 3: 1, 4: 1, 5: 2, 6: 3, 7: 1},
'purchases': {0: 1, 1: 2, 2: 3, 3: 1, 4: 1, 5: 2, 6: 3, 7: 1},
'sales': {0: 10, 1: 20, 2: 30, 3: 40, 4: 10, 5: 20, 6: 30, 7: 40}})
Solution
#First group by entrydate and id, summing sales and take the max from purchases(removing duplicates). Then another group by to sum sales and purchases.
df.groupby(['entrydate','id']).agg({'sales':sum, 'purchases':max}).groupby(level=0).sum().reset_index()
Out[431]:
entrydate purchases sales
0 05/03/2017 6 100
1 06/03/2017 6 100
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.