[英]Drop Duplicates dataframe keep first or not empty value
I have a dataframe and want to drop just for a name the double dates and take from the doubles just the first entry or the value if it is filled, eg:我有一个数据框,只想删除双日期的名称,并从双打中取出第一个条目或值(如果已填充),例如:
01/02/19 Paolo 42
01/02/19 Paolo 9
The first one is kept: 01/02/19 Paolo 42.第一个保留:01/02/19 Paolo 42。
01/02/19 Frank
01/02/19 Frank 30
The second one is kept: 01/02/19 Frank 30.保留第二个:01/02/19 Frank 30。
When using drop_duplicates it removes almost everything and keeps just a small set.使用drop_duplicates 时,它会删除几乎所有内容并只保留一小部分。
My code looks like the following:我的代码如下所示:
import numpy as np
import pandas as pd
path = 'path'
filename = 'Dummy_File_Test.xlsx'
final_path = path + '/' + filename
print(final_path)
ws_name = 'Sheet1'
df = pd.read_excel(final_path, sheet_name=ws_name)
df.fillna('', inplace=True)
df.drop_duplicates(subset =['Date'], keep = 'first', inplace = True, ignore_index=False)
print(df)
The data looks like the following:数据如下所示:
Date Name Revenue
01/01/19 Paolo 9
01/02/19 Paolo 42
01/02/19 Paolo 9
01/03/19 Paolo 10
01/04/19 Paolo 38
01/05/19 Paolo
01/06/19 Paolo
01/07/19 Paolo 41
01/08/19 Paolo
01/09/19 Paolo 20
01/10/19 Paolo
01/11/19 Paolo 3
01/12/19 Paolo 2
01/01/19 Frank 9
01/02/19 Frank
01/02/19 Frank 30
01/03/19 Frank 10
01/04/19 Frank
01/05/19 Frank
01/06/19 Frank
01/06/19 Frank
01/07/19 Frank
01/08/19 Frank
01/08/19 Frank
01/09/19 Frank
01/10/19 Frank
01/10/19 Frank 48
01/11/19 Frank 22
01/11/19 Frank
01/12/19 Frank 47
01/01/19 Emilia
01/02/19 Emilia 12
01/02/19 Emilia 15
01/03/19 Emilia 23
01/04/19 Emilia 25
01/05/19 Emilia
01/05/19 Emilia 39
01/06/19 Emilia 30
01/06/19 Emilia 24
01/07/19 Emilia 4
01/08/19 Emilia
01/08/19 Emilia 49
01/09/19 Emilia 24
01/10/19 Emilia
01/11/19 Emilia 12
01/12/19 Emilia 33
The output should look like the following:输出应如下所示:
Date Name Revenue
01/01/19 Paolo 9
01/02/19 Paolo 42
01/03/19 Paolo 10
01/04/19 Paolo 38
01/05/19 Paolo
01/06/19 Paolo
01/07/19 Paolo 41
01/08/19 Paolo
01/09/19 Paolo 20
01/10/19 Paolo
01/11/19 Paolo 3
01/12/19 Paolo 2
01/01/19 Frank 9
01/02/19 Frank 30
01/03/19 Frank 10
01/04/19 Frank
01/05/19 Frank
01/06/19 Frank
01/07/19 Frank
01/08/19 Frank
01/09/19 Frank
01/10/19 Frank 48
01/11/19 Frank 22
01/12/19 Frank 47
01/01/19 Emilia
01/02/19 Emilia 12
01/03/19 Emilia 23
01/04/19 Emilia 25
01/05/19 Emilia 39
01/06/19 Emilia 30
01/07/19 Emilia 4
01/08/19 Emilia 49
01/09/19 Emilia 24
01/10/19 Emilia
01/11/19 Emilia 12
01/12/19 Emilia 33
Please note the change in column names {Date:date, Name:name, Revenue:values}
because I generated own data请注意列名{Date:date, Name:name, Revenue:values}
因为我生成了自己的数据
Coerce date to datetime and set it (date) as index将日期强制转换为日期时间并将其(日期)设置为索引
df['Date']=pd.to_datetime(df['date'])
df.set_index(df['Date'], inplace=True)
Sort name, date ascending and values descending so that the highest value is on top对名称、日期升序和值降序进行排序,以便最高值位于顶部
df.sort_values(by=['date','name','value'],ascending=[True, True, False], inplace=True)
Groupby date and name while dropping all the rest other than the first Groupby 日期和名称,同时删除除第一个以外的所有其他内容
df.groupby([df.index.date, df.name])['value'].first()
If wanted to convert back to dataframe如果想转换回数据帧
df.groupby([df.index.date, df.name])['value'].first().to_frame()
Output输出
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.