简体   繁体   中英

How to check if dates in one pandas dataframe fall within ranges of dates in another?

I have two pandas dataframes. One is my main file with lots of rows containing "events" such as in the example below, lets call it main_df. While only 2 rows are given, there are around 200.000 in the real file.

event_nr date detail_1 detail_2
e_0001 2010-01-02 07:30:00 1 100
e_0002 2010-06-05 07:30:00 0 132

I then have another df_ranges with date ranges, here also there are a multitude of ranges, albeit not overlapping.

start end name
2010-02-25 2010-03-06 range_1
2010-06-02 2010-06-08 range_2

I now want to check if each date in main_df occurs during any of the time ranges in df_ranges, and if so, create a new column in main_df with the name of the range as a value in the same row.

Desired result:

event_nr date detail_1 detail_2 in_range
e_0001 2010-01-02 07:30:00 1 100 None
e_0002 2010-06-05 07:30:00 0 132 range_2

Most solutions I have encountered only wanted to check if certain dates were between start and end of one single range. I am sure I have to iterate over all ranges and all dates, but don't know how to proceed in practice.

Thank you!

Here is a solution.

Given the following DataFrames:

events = pd.DataFrame(
    {
        "event_nr":["e_0001", "e_0002"],
        "date":["2010-01-02 07:30:00", "2010-06-05 07:30:00"]
    }
)


ranges = pd.DataFrame(
    {
        "start":["2010-02-25", "2010-06-02"],
        "end":["2010-03-06", "2010-06-08"],
        "range_name":["range_1","range_2"] 
    }
)

First, let's make sure all dates are actually datetime objects:

events["date"] = pd.to_datetime(events["date"])
ranges["start"] = pd.to_datetime(ranges["start"])
ranges["end"] = pd.to_datetime(ranges["end"])

Then, let's make a function that implement the desired selection criteria.

def f(row):
    global ranges
    for _i, _row in ranges.iterrows():
        if _row.start <= row.date <= _row.end:
            return _row.range_name
    return None

Finally, let's apply the function to the dataframe to generate the desired new column.

events["in_range"] = events.apply(f, axis=1)

And here is the result:

print(events)

  event_nr                date in_range
0   e_0001 2010-01-02 07:30:00     None
1   e_0002 2010-06-05 07:30:00  range_2

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM