[英]Remove records from dataframe based on date condition
I have dataframe with a column named date, it contains the following dates:我有一个名为日期的列 dataframe,它包含以下日期:
In [67]: df.date.drop_duplicates()
Out[67]:
0 2020-02-04
570 2020-02-19
1157 2020-03-03
1791 2020-04-02
2452 2020-04-08
3113 2020-05-05
3777 2020-06-03
4445 2020-07-02
5131 2020-08-04
Name: date, dtype: datetime64[ns]
I only want to have monthly data.我只想拥有每月数据。 And from the monthly data I want to keep the earliest of this month.
从每月数据来看,我想保留本月最早的数据。 So here I want to delete all records where the date is 2020-02-19 and 2020-04-08.
所以在这里我想删除日期为 2020-02-19 和 2020-04-08 的所有记录。 The problem is that I never know which dates I will recieve.
问题是我永远不知道我会收到哪些日期。 I could have also recieved 2020-07-22 as well - then I would have wanted to delete all records with date 2020-07-22 too as I alreaedy have 2020-07-02.
我也可以收到 2020-07-22 - 然后我也想删除日期为 2020-07-22 的所有记录,因为我已经有 2020-07-02。
Do you know a smooth way to do that?你知道一个平滑的方法吗? I thought of sorting the values in a way that they look like this:
我想以它们看起来像这样的方式对值进行排序:
2020-02-04
2020-03-03
2020-04-02
2020-05-05
2020-06-03
2020-07-02
2020-08-04
2020-02-19
2020-04-08
Then I could delete all records where the date is one of the dates after the 7th row (counted from 1), as I always have a variable that determines how many datapoints I need.然后我可以删除日期是第 7 行之后的日期之一的所有记录(从 1 开始计数),因为我总是有一个变量来确定我需要多少数据点。 But I couldn't figure out how to sort it like that.
但我无法弄清楚如何对它进行排序。 Do you know any other way or could help me sorting the date values?
你知道任何其他方式或可以帮助我对日期值进行排序吗? Thank you so much!
太感谢了!
IIUC, you can do a groupby
month, and then get the min
: groupby
,你可以按月做一个分组,然后得到min
:
df.groupby(df.date.dt.month).min()
If 'date'
has more than one year, group by year and month:如果
'date'
超过一年,则按年和月分组:
df.groupby([df.date.dt.month,df.date.dt.year]).min()
Output: Output:
date
2 2020-02-04
3 2020-03-03
4 2020-04-02
5 2020-05-05
6 2020-06-03
7 2020-07-02
8 2020-08-04
This works even if your data is longer than a year:即使您的数据超过一年,这也有效:
df.sort_values(by='date').groupby(df.date.dt.year.astype(str)
+ df.date.dt.month.astype(str)).first()
Output: Output:
0 idx date
date
20202 0 2020-02-04
20203 1157 2020-03-03
20204 1791 2020-04-02
20205 3113 2020-05-05
20206 3777 2020-06-03
20207 4445 2020-07-02
20208 5131 2020-08-04
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.