简体   繁体   English

删除重复数据帧保持第一个或不为空值

[英]Drop Duplicates dataframe keep first or not empty value

I have a dataframe and want to drop just for a name the double dates and take from the doubles just the first entry or the value if it is filled, eg:我有一个数据框,只想删除双日期的名称,并从双打中取出第一个条目或值(如果已填充),例如:

01/02/19    Paolo   42
01/02/19    Paolo   9

The first one is kept: 01/02/19 Paolo 42.第一个保留:01/02/19 Paolo 42。

01/02/19    Frank   
01/02/19    Frank   30

The second one is kept: 01/02/19 Frank 30.保留第二个:01/02/19 Frank 30。

When using drop_duplicates it removes almost everything and keeps just a small set.使用drop_duplicates 时,它会删除几乎所有内容并只保留一小部分。

My code looks like the following:我的代码如下所示:

import numpy as np
import pandas as pd

path = 'path'
filename = 'Dummy_File_Test.xlsx'
final_path = path + '/' + filename
print(final_path)
ws_name = 'Sheet1'

df = pd.read_excel(final_path, sheet_name=ws_name)
df.fillna('', inplace=True)
df.drop_duplicates(subset =['Date'], keep = 'first', inplace = True, ignore_index=False) 
print(df)

The data looks like the following:数据如下所示:

Date    Name    Revenue
01/01/19    Paolo   9
01/02/19    Paolo   42
01/02/19    Paolo   9
01/03/19    Paolo   10
01/04/19    Paolo   38
01/05/19    Paolo   
01/06/19    Paolo   
01/07/19    Paolo   41
01/08/19    Paolo   
01/09/19    Paolo   20
01/10/19    Paolo   
01/11/19    Paolo   3
01/12/19    Paolo   2
01/01/19    Frank   9
01/02/19    Frank   
01/02/19    Frank   30
01/03/19    Frank   10
01/04/19    Frank   
01/05/19    Frank   
01/06/19    Frank   
01/06/19    Frank   
01/07/19    Frank   
01/08/19    Frank   
01/08/19    Frank   
01/09/19    Frank   
01/10/19    Frank   
01/10/19    Frank   48
01/11/19    Frank   22
01/11/19    Frank   
01/12/19    Frank   47
01/01/19    Emilia  
01/02/19    Emilia  12
01/02/19    Emilia  15
01/03/19    Emilia  23
01/04/19    Emilia  25
01/05/19    Emilia  
01/05/19    Emilia  39
01/06/19    Emilia  30
01/06/19    Emilia  24
01/07/19    Emilia  4
01/08/19    Emilia  
01/08/19    Emilia  49
01/09/19    Emilia  24
01/10/19    Emilia  
01/11/19    Emilia  12
01/12/19    Emilia  33

The output should look like the following:输出应如下所示:

Date    Name    Revenue
01/01/19    Paolo   9
01/02/19    Paolo   42
01/03/19    Paolo   10
01/04/19    Paolo   38
01/05/19    Paolo   
01/06/19    Paolo   
01/07/19    Paolo   41
01/08/19    Paolo   
01/09/19    Paolo   20
01/10/19    Paolo   
01/11/19    Paolo   3
01/12/19    Paolo   2
01/01/19    Frank   9
01/02/19    Frank   30
01/03/19    Frank   10
01/04/19    Frank   
01/05/19    Frank   
01/06/19    Frank   
01/07/19    Frank   
01/08/19    Frank   
01/09/19    Frank   
01/10/19    Frank   48
01/11/19    Frank   22
01/12/19    Frank   47
01/01/19    Emilia  
01/02/19    Emilia  12
01/03/19    Emilia  23
01/04/19    Emilia  25
01/05/19    Emilia  39
01/06/19    Emilia  30
01/07/19    Emilia  4
01/08/19    Emilia  49
01/09/19    Emilia  24
01/10/19    Emilia  
01/11/19    Emilia  12
01/12/19    Emilia  33

Please note the change in column names {Date:date, Name:name, Revenue:values} because I generated own data请注意列名{Date:date, Name:name, Revenue:values}因为我生成了自己的数据

Coerce date to datetime and set it (date) as index将日期强制转换为日期时间并将其(日期)设置为索引

df['Date']=pd.to_datetime(df['date'])
df.set_index(df['Date'], inplace=True)

Sort name, date ascending and values descending so that the highest value is on top对名称、日期升序和值降序进行排序,以便最高值位于顶部

df.sort_values(by=['date','name','value'],ascending=[True, True, False], inplace=True)

Groupby date and name while dropping all the rest other than the first Groupby 日期和名称,同时删除除第一个以外的所有其他内容

df.groupby([df.index.date, df.name])['value'].first()

If wanted to convert back to dataframe如果想转换回数据帧

df.groupby([df.index.date, df.name])['value'].first().to_frame()

Output输出

在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM