简体   繁体   English

从列表中查找最接近给定日期且不晚于给定日期的日期

[英]Find the closest date from a list to a given date that is not after the given date

I have a dataframe for weekly training sessions and a data frame for evaluations submitted by attendees at those training sessions.我有一个 dataframe 用于每周的培训课程,还有一个数据框架用于参加者在这些培训课程中提交的评估。

Each dataframe has a date column - for sessions, it is the date the session occurred.每个 dataframe 都有一个日期列 - 对于会话,它是 session 发生的日期。 For evaluations, it is the date the evaluation was submitted.对于评估,这是提交评估的日期。 Attendees can be expected to attend multiple sessions and will therefore have submitted multiple evaluations.预计与会者将参加多个会议,因此将提交多个评估。

I need to tie each evaluation back to a specific session. They may have submitted an evaluation on the same day as a session, in which case the match is easy.我需要将每个评估与特定的 session 联系起来。他们可能在与 session 同一天提交了评估,在这种情况下匹配很容易。 But they are able to submit an evaluation on any day up to the next training session.但他们可以在下一次培训 session 之前的任何一天提交评估。

For each date in the evaluation df, I need to return the session date that is closest to the evaluation date but not after the evaluation date.对于评估 df 中的每个日期,我需要返回最接近评估日期但不晚于评估日期的 session 日期。

example session dates: 2/3/22, 2/10/22, 2/17/22例如 session 日期:2/3/22、2/10/22、2/17/22

example evaluation dates with desired output: 2/3/22 (should match 2/3/22), 2/4/22 (should match 2/3/22), 2/11/22 (should match 2/10/22)示例评估日期与所需的 output:2/3/22(应匹配 2/3/22)、2/4/22(应匹配 2/3/22)、2/11/22(应匹配 2/10/22 )

Here's a way to do it.这是一种方法。

In the sessions dataframe, set date column to be the index:sessions dataframe 中,将date列设置为索引:

sessions = sessions.set_index('date')

Sort sessions by index (that is, by date):按索引(即按日期)对会话进行排序:

sessions = sessions.loc[sessions.index.sort_values()]

Add a session_evaluated column to evaluations which will contain the date of the session that the evaluation applies to.session_evaluated列添加到评估中,其中将包含评估适用的 session 的日期。 We calculate this by first calling sessions.index.get_indexer() on the date column of evaluations with the method argument set to 'pad' so we "round down" on non-matching dates, and then doing a lookup on these integer index values in the sessions index (which contains the session dates):我们通过首先在评估的date列上调用sessions.index.get_indexer()并将method参数设置为“pad”来计算这一点,因此我们在不匹配的日期上“向下舍入”,然后查找这些 integer 索引值在会话索引中(包含 session 日期):

evaluations['session_evaluated'] = pd.Series([sessions.index.to_list()[i] 
    for i in sessions.index.get_indexer(evaluations['date'], method='pad')])

Here's what it looks like all put together with sample inputs:这是将所有内容与示例输入放在一起的样子:

import pandas as pd
sessions = pd.DataFrame({
    'date' : ['2022-02-01', '2022-03-01', '2022-04-01', '2022-05-01', '2022-01-01'],
    'topic' : ['Easy 1', 'Easy 2', 'Intermediate', 'Advanced', 'Intro']
})
evaluations = pd.DataFrame({
    'date' : [
        '2022-01-05', '2022-01-10', '2022-01-15', '2022-01-20', '2022-01-25', 
        '2022-02-01', '2022-02-05', '2022-02-28',
        '2022-03-01', '2022-03-15', '2022-03-31',
        '2022-04-01', '2022-04-15'
    ],
    'rating' : [9,8,7,8,9,5,4,3,10,10,10,2,4]
})
sessions['date'] = pd.to_datetime(sessions['date'])
evaluations['date'] = pd.to_datetime(evaluations['date'])
sessions = sessions.set_index('date')
sessions = sessions.loc[sessions.index.sort_values()]
print(sessions)
print(evaluations)
evaluations['session_evaluated'] = pd.Series([sessions.index.to_list()[i]
    for i in sessions.index.get_indexer(evaluations['date'], method='pad')])
print(evaluations)

Results:结果:

                   topic
date
2022-01-01         Intro
2022-02-01        Easy 1
2022-03-01        Easy 2
2022-04-01  Intermediate
2022-05-01      Advanced
         date  rating
0  2022-01-05       9
1  2022-01-10       8
2  2022-01-15       7
3  2022-01-20       8
4  2022-01-25       9
5  2022-02-01       5
6  2022-02-05       4
7  2022-02-28       3
8  2022-03-01      10
9  2022-03-15      10
10 2022-03-31      10
11 2022-04-01       2
12 2022-04-15       4
         date  rating session_evaluated
0  2022-01-05       9        2022-01-01
1  2022-01-10       8        2022-01-01
2  2022-01-15       7        2022-01-01
3  2022-01-20       8        2022-01-01
4  2022-01-25       9        2022-01-01
5  2022-02-01       5        2022-02-01
6  2022-02-05       4        2022-02-01
7  2022-02-28       3        2022-02-01
8  2022-03-01      10        2022-03-01
9  2022-03-15      10        2022-03-01
10 2022-03-31      10        2022-03-01
11 2022-04-01       2        2022-04-01
12 2022-04-15       4        2022-04-01

UPDATED:更新:

Here's another way to do it using the merge_asof() function. It doesn't require the date column to be the index (though it does require that both dataframe arguments be sorted by date ):这是使用merge_asof() function 的另一种方法。它不需要日期列作为索引(尽管它确实要求 dataframe arguments 都按date排序):

sessions['date'] = pd.to_datetime(sessions['date'])
evaluations['date'] = pd.to_datetime(evaluations['date'])
evaluations = pd.merge_asof(
    evaluations.sort_values(by=['date']), 
    sessions.sort_values(by=['date'])['date'].to_frame().assign(session_evaluated=sessions['date']), 
    on='date')
print(evaluations)

Output: Output:

         date  rating session_evaluated
0  2022-01-05       9        2022-01-01
1  2022-01-10       8        2022-01-01
2  2022-01-15       7        2022-01-01
3  2022-01-20       8        2022-01-01
4  2022-01-25       9        2022-01-01
5  2022-02-01       5        2022-02-01
6  2022-02-05       4        2022-02-01
7  2022-02-28       3        2022-02-01
8  2022-03-01      10        2022-03-01
9  2022-03-15      10        2022-03-01
10 2022-03-31      10        2022-03-01
11 2022-04-01       2        2022-04-01
12 2022-04-15       4        2022-04-01

UPDATE #2: The call to assign() in the above code can also be written using **kwargs syntax, in case we want to use a column name with spaces or that otherwise is not a valid python identifier (instead of session_evaluated ).更新#2:上面代码中对assign()的调用也可以使用**kwargs语法来编写,以防我们想要使用带空格的列名或者不是有效的 python 标识符(而不是session_evaluated )。 For example:例如:

evaluations = pd.merge_asof(
    evaluations.sort_values(by=['date']), 
    sessions.sort_values(by=['date'])['date'].to_frame()
        .assign(**{'Evaluated Session (Date)' : lambda x: sessions['date']}), 
    on='date')

Output: Output:

         date  rating Evaluated Session (Date)
0  2022-01-05       9               2022-01-01
1  2022-01-10       8               2022-01-01
2  2022-01-15       7               2022-01-01
3  2022-01-20       8               2022-01-01
4  2022-01-25       9               2022-01-01
5  2022-02-01       5               2022-02-01
6  2022-02-05       4               2022-02-01
7  2022-02-28       3               2022-02-01
8  2022-03-01      10               2022-03-01
9  2022-03-15      10               2022-03-01
10 2022-03-31      10               2022-03-01
11 2022-04-01       2               2022-04-01
12 2022-04-15       4               2022-04-01

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM