简体   繁体   English

尝试删除重复项时,熊猫只删除某些列值

[英]Pandas drop only certain column values when trying to remove duplicates

My question is somewhat similar to this one , but not quite. 我的问题与类似,但不完全相同。 I have a CSV with the following kind of structure 我有一个具有以下结构的CSV文件

| id | entrydate  | sales | purchases |
| -- | -----------| ----- | --------- |
| 1  | 05/03/2017 | 10    | 1         |
| 2  | 05/03/2017 | 20    | 2         |
| 3  | 05/03/2017 | 30    | 3         |
| 1  | 05/03/2017 | 40    | 1         |

I'm reading this into a dataframe, and I want to get daily aggregates of sales and purchases (individual id doesn't matter, just daily aggregates). 我正在将其读入数据框,并且想要获取每日的销售和购买汇总(单个ID无关紧要,只是每日汇总)。

First, however, I need to remove duplicates. 但是,首先,我需要删除重复项。 This is tripping me up, because if you take the example above, for id 1 , there are two entries on the same day, but multiple entries in the purchases column are to be considered duplicates, whereas multiple entries in the sales column are valid, so the correct grouping would result in 这让我感到震惊,因为如果您使用上面的示例,对于ID 1 ,同一天有两个条目,但是“ purchases列中的多个条目将被视为重复项,而“ sales列中的多个条目是有效的,因此正确的分组将导致

| id | entrydate  | sales | purchases |
| -- | -----------| ----- | --------- |
| 1  | 05/03/2017 | 50    | 1         |
| 2  | 05/03/2017 | 20    | 2         |
| 3  | 05/03/2017 | 30    | 3         |

and then getting the daily aggregate would give me 然后获取每日总计将给我

|entrydate   | sales | purchases |
| -----------| ----- | --------- |
| 05/03/2017 | 100   | 6         |

I was trying to remove the purchases duplicates using 我正在尝试使用以下方式删除purchases重复项

df = pandas.read_csv('../my-csv.csv', parse_dates=True, dayfirst=True, usecols=my_columns, dtype=my_dtypes).rename(columns=str.lower).assign(date=lambda x: pd.to_datetime(x['entrydate'], format="%d/%m/%Y")).set_index('date')

在此处输入图片说明

df = df.drop_duplicates(['id', 'entrydate', 'purchases'])
df.drop(['id'], axis=1, inplace=True)
df = df.groupby(pd.TimeGrouper(freq='D')).sum()

but while this will remove the duplicate purchases it also removes valid sales 但这会删除重复的purchases但也会删除有效的sales

在此处输入图片说明


Image for the solution by A-Za-z A-Za-z解决方案的图片

在此处输入图片说明

If you groupby entrydate you can aggregate both sales and purchases: 如果按输入日期分组,则可以汇总销售和购买:

In [11]: df.groupby("entrydate").agg({"sales": "sum", "purchases": "sum"})
Out[11]:
            sales  purchases
entrydate
05/03/2017    100          7

You can use groupby twice, first to aggregate sales 您可以使用groupby两次,首先要汇总销售

df.sales = df.groupby('id').sales.transform('sum')
df = df.drop_duplicates()
df.groupby(df.entrydate).sum().reset_index()


    entrydate   sales   purchases
0   2017-05-03  100     6

EDIT: To account for sum over different dates 编辑:占不同日期的总和

df.sales = df.groupby(['id', 'date']).sales.transform('sum')
df = df.drop_duplicates()
df.groupby('date')['sales', 'purchases'].sum().reset_index()

You get 你得到

    date        sales   purchases
0   2017-03-05  100     6
1   2017-03-06  40      1

Setup 设定

df = pd.DataFrame({'entrydate': {0: '05/03/2017',
  1: '05/03/2017',
  2: '05/03/2017',
  3: '05/03/2017',
  4: '06/03/2017',
  5: '06/03/2017',
  6: '06/03/2017',
  7: '06/03/2017'},
 'id': {0: 1, 1: 2, 2: 3, 3: 1, 4: 1, 5: 2, 6: 3, 7: 1},
 'purchases': {0: 1, 1: 2, 2: 3, 3: 1, 4: 1, 5: 2, 6: 3, 7: 1},
 'sales': {0: 10, 1: 20, 2: 30, 3: 40, 4: 10, 5: 20, 6: 30, 7: 40}})

Solution

#First group by entrydate and id, summing sales and take the max from purchases(removing duplicates). Then another group by to sum sales and purchases.
df.groupby(['entrydate','id']).agg({'sales':sum, 'purchases':max}).groupby(level=0).sum().reset_index()
Out[431]: 
    entrydate  purchases  sales
0  05/03/2017          6    100
1  06/03/2017          6    100

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM