简体   繁体   中英

Pandas drop only certain column values when trying to remove duplicates

My question is somewhat similar to this one , but not quite. I have a CSV with the following kind of structure

| id | entrydate  | sales | purchases |
| -- | -----------| ----- | --------- |
| 1  | 05/03/2017 | 10    | 1         |
| 2  | 05/03/2017 | 20    | 2         |
| 3  | 05/03/2017 | 30    | 3         |
| 1  | 05/03/2017 | 40    | 1         |

I'm reading this into a dataframe, and I want to get daily aggregates of sales and purchases (individual id doesn't matter, just daily aggregates).

First, however, I need to remove duplicates. This is tripping me up, because if you take the example above, for id 1 , there are two entries on the same day, but multiple entries in the purchases column are to be considered duplicates, whereas multiple entries in the sales column are valid, so the correct grouping would result in

| id | entrydate  | sales | purchases |
| -- | -----------| ----- | --------- |
| 1  | 05/03/2017 | 50    | 1         |
| 2  | 05/03/2017 | 20    | 2         |
| 3  | 05/03/2017 | 30    | 3         |

and then getting the daily aggregate would give me

|entrydate   | sales | purchases |
| -----------| ----- | --------- |
| 05/03/2017 | 100   | 6         |

I was trying to remove the purchases duplicates using

df = pandas.read_csv('../my-csv.csv', parse_dates=True, dayfirst=True, usecols=my_columns, dtype=my_dtypes).rename(columns=str.lower).assign(date=lambda x: pd.to_datetime(x['entrydate'], format="%d/%m/%Y")).set_index('date')

在此处输入图片说明

df = df.drop_duplicates(['id', 'entrydate', 'purchases'])
df.drop(['id'], axis=1, inplace=True)
df = df.groupby(pd.TimeGrouper(freq='D')).sum()

but while this will remove the duplicate purchases it also removes valid sales

在此处输入图片说明


Image for the solution by A-Za-z

在此处输入图片说明

If you groupby entrydate you can aggregate both sales and purchases:

In [11]: df.groupby("entrydate").agg({"sales": "sum", "purchases": "sum"})
Out[11]:
            sales  purchases
entrydate
05/03/2017    100          7

You can use groupby twice, first to aggregate sales

df.sales = df.groupby('id').sales.transform('sum')
df = df.drop_duplicates()
df.groupby(df.entrydate).sum().reset_index()


    entrydate   sales   purchases
0   2017-05-03  100     6

EDIT: To account for sum over different dates

df.sales = df.groupby(['id', 'date']).sales.transform('sum')
df = df.drop_duplicates()
df.groupby('date')['sales', 'purchases'].sum().reset_index()

You get

    date        sales   purchases
0   2017-03-05  100     6
1   2017-03-06  40      1

Setup

df = pd.DataFrame({'entrydate': {0: '05/03/2017',
  1: '05/03/2017',
  2: '05/03/2017',
  3: '05/03/2017',
  4: '06/03/2017',
  5: '06/03/2017',
  6: '06/03/2017',
  7: '06/03/2017'},
 'id': {0: 1, 1: 2, 2: 3, 3: 1, 4: 1, 5: 2, 6: 3, 7: 1},
 'purchases': {0: 1, 1: 2, 2: 3, 3: 1, 4: 1, 5: 2, 6: 3, 7: 1},
 'sales': {0: 10, 1: 20, 2: 30, 3: 40, 4: 10, 5: 20, 6: 30, 7: 40}})

Solution

#First group by entrydate and id, summing sales and take the max from purchases(removing duplicates). Then another group by to sum sales and purchases.
df.groupby(['entrydate','id']).agg({'sales':sum, 'purchases':max}).groupby(level=0).sum().reset_index()
Out[431]: 
    entrydate  purchases  sales
0  05/03/2017          6    100
1  06/03/2017          6    100

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM