简体   繁体   English

根据使用 pandas 的条件排除特定日期

[英]Exclude a specific date based on a condition using pandas

df2 = pd.DataFrame({'person_id':[11,11,11,11,11,12,12,13,13,14,14,14,14],
                    'admit_date':['01/01/2011','01/01/2009','12/31/2013','12/31/2017','04/03/2014','08/04/2016',
                                  '03/05/2014','02/07/2011','08/08/2016','12/31/2017','05/01/2011','05/21/2014','07/12/2016']})
df2 = df2.melt('person_id', value_name='dates')
df2['dates'] = pd.to_datetime(df2['dates'])

What I would like to do is我想做的是

a) Exclude/filter out records from the data frame if a subject has Dec 31st and Jan 1st in its records. a) 如果受试者的记录中有Dec 31stJan 1st日,则从数据框中排除/过滤记录。 Please note that year doesn't matter.请注意, year无关紧要。

If a subject has either Dec 31st or Jan 1st , we leave them as is .如果一个主题有Dec 31stJan 1st ,我们将它们保持原样

But if they have both Dec 31st and Jan 1st , we remove one (either Dec 31st or Jan 1st) of them.但如果他们同时拥有Dec 31stJan 1st ,我们将删除其中一个(12 月 31 日或 1 月 1 日)。 note they could have multiple entries with the same date as well.请注意,他们也可以有多个具有相同日期的条目。 Like person_id = 11喜欢person_id = 11

I could only do the below我只能做以下

df2_new =  df2['dates'] != '2017-12-31'  #but this excludes if a subject has only `Dec 31st on 2017`. How can I ignore the dates and not consider `year`
df2[df2_new]  

My expected output is like as shown below我预期的 output 如下图所示

在此处输入图像描述

For person_id = 11, we drop 12-31 because it had both 12-31 and 01-01 in their records whereas for person_id = 14, we don't drop 12-31 because it has only 12-31 in its records.对于 person_id = 11,我们删除12-31 ,因为它的记录中同时包含12-3101-01 ,而对于 person_id = 14,我们不删除12-31 ,因为它的记录中只有12-31

We drop 12-31 only when both 12-31 and 01-01 appear in a person's records.只有当12-3101-01都出现在一个人的记录中时,我们才会删除12-31

Another way另一种方式

Coerce the date to day month.将日期强制转换为月份。 Create temp column where 31st Dec is converted to 1st Jan Drop duplicates by Person id and the temp column keeping first.创建临时列,其中31st Dec转换为1st JanPerson id删除重复项, temp column保持第一。

 df2['dates']=df2['dates'].dt.strftime('%d %b')
df2=df2.assign(check=np.where(df2.dates=='31 Dec','01 Jan', df2.dates)).drop_duplicates(['person_id', 'variable', 'check'], keep='first').drop(columns=['check'])



 person_id    variable   dates   check
0          11  admit_date  01 Jan  01 Jan
4          11  admit_date  03 Apr  03 Apr
5          12  admit_date  04 Aug  04 Aug
6          12  admit_date  05 Mar  05 Mar
7          13  admit_date  07 Feb  07 Feb
8          13  admit_date  08 Aug  08 Aug
9          14  admit_date  31 Dec  01 Jan
10         14  admit_date  01 May  01 May
11         14  admit_date  21 May  21 May
12         14  admit_date  12 Jul  12 Jul

Use:利用:

s = df2['dates'].dt.strftime('%m-%d')
m1 = s.eq('01-01').groupby(df2['person_id']).transform('any')
m2 = s.eq('12-31').groupby(df2['person_id']).transform('any')
m3 = np.select([m1 & m2, m1 | m2], [s.ne('12-31'), True], default=True)
df3 = df2[m3]

Result:结果:

# print(df3)
    person_id    variable      dates
0          11  admit_date 2011-01-01
1          11  admit_date 2009-01-01
4          11  admit_date 2014-04-03
5          12  admit_date 2016-08-04
6          12  admit_date 2014-03-05
7          13  admit_date 2011-02-07
8          13  admit_date 2016-08-08
9          14  admit_date 2017-12-31
10         14  admit_date 2011-05-01
11         14  admit_date 2014-05-21
12         14  admit_date 2016-07-12

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM