简体   繁体   English

根据日期条件从 dataframe 中删除记录

[英]Remove records from dataframe based on date condition

I have dataframe with a column named date, it contains the following dates:我有一个名为日期的列 dataframe,它包含以下日期:

In [67]: df.date.drop_duplicates()
Out[67]: 
0      2020-02-04
570    2020-02-19
1157   2020-03-03
1791   2020-04-02
2452   2020-04-08
3113   2020-05-05
3777   2020-06-03
4445   2020-07-02
5131   2020-08-04
Name: date, dtype: datetime64[ns]

I only want to have monthly data.我只想拥有每月数据。 And from the monthly data I want to keep the earliest of this month.从每月数据来看,我想保留本月最早的数据。 So here I want to delete all records where the date is 2020-02-19 and 2020-04-08.所以在这里我想删除日期为 2020-02-19 和 2020-04-08 的所有记录。 The problem is that I never know which dates I will recieve.问题是我永远不知道我会收到哪些日期。 I could have also recieved 2020-07-22 as well - then I would have wanted to delete all records with date 2020-07-22 too as I alreaedy have 2020-07-02.我也可以收到 2020-07-22 - 然后我也想删除日期为 2020-07-22 的所有记录,因为我已经有 2020-07-02。

Do you know a smooth way to do that?你知道一个平滑的方法吗? I thought of sorting the values in a way that they look like this:我想以它们看起来像这样的方式对值进行排序:

2020-02-04
2020-03-03
2020-04-02
2020-05-05
2020-06-03
2020-07-02
2020-08-04
2020-02-19
2020-04-08

Then I could delete all records where the date is one of the dates after the 7th row (counted from 1), as I always have a variable that determines how many datapoints I need.然后我可以删除日期是第 7 行之后的日期之一的所有记录(从 1 开始计数),因为我总是有一个变量来确定我需要多少数据点。 But I couldn't figure out how to sort it like that.但我无法弄清楚如何对它进行排序。 Do you know any other way or could help me sorting the date values?你知道任何其他方式或可以帮助我对日期值进行排序吗? Thank you so much!太感谢了!

IIUC, you can do a groupby month, and then get the min : groupby ,你可以按月做一个分组,然后得到min

df.groupby(df.date.dt.month).min()

If 'date' has more than one year, group by year and month:如果'date'超过一年,则按年和月分组:

df.groupby([df.date.dt.month,df.date.dt.year]).min()

Output: Output:

           date
           
2    2020-02-04
3    2020-03-03
4    2020-04-02
5    2020-05-05
6    2020-06-03
7    2020-07-02
8    2020-08-04

This works even if your data is longer than a year:即使您的数据超过一年,这也有效:

df.sort_values(by='date').groupby(df.date.dt.year.astype(str)
                                  + df.date.dt.month.astype(str)).first()

Output: Output:

0       idx       date
date                  
20202     0 2020-02-04
20203  1157 2020-03-03
20204  1791 2020-04-02
20205  3113 2020-05-05
20206  3777 2020-06-03
20207  4445 2020-07-02
20208  5131 2020-08-04

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM