简体   繁体   中英

How to select rows within two dates in a pandas dataframe?

I have a pandas dataframe such as:

ids    year    month    visit_counts
112    2012      1           5
112    2012      5           25
414    2012      2           10
112    2013      5           10
414    2012      4           35
112    2013      8           16
112    2013     12           23
112    2014      2           20
112    2014      4           28
414    2013      6           15
414    2013     11           30
414    2014      2           24
414    2014      7           17

Each row represent a record with a date of occurrence. lets say, the first date of occurrence associated with id 112 is (year 2012, and month january (1)), and this first date of occurrence for different ids are different. I need to create new dataframe where for every id I want to have all the records within two years starting since the first date of occurrence.

So, the output dataframe should look like this:

ids    year    month    visit_counts
112    2012      1           5
112    2012      5           25
414    2012      2           10
112    2013      5           10
414    2012      4           35
112    2013      8           16
112    2013     12           23
414    2013      6           15
414    2013     11           30
414    2014      2           24

It would be helpful, if anyone can provide the codes. Thank you!

First we create a helper column time_in_months to be able to compare the dates. Then use .groupby and .transform to find the dates which are at maximum 24 months apart from the minimum date and filter for these.

df['time_in_months'] = df['year'] * 12 + df['month']
df[df.groupby('ids').time_in_months.transform(lambda x: x -x.min() <= 24  )].drop('time_in_months', 1)

result

        ids     year    month visit_counts
0       112     2012    1     5
1       112     2012    5     25
2       414     2012    2     10
3       112     2013    5     10
4       414     2012    4     35
5       112     2013    8     16
6       112     2013    12    23
9       414     2013    6     15
10      414     2013    11    30
11      414     2014    2     24

Work with datetime64 . We create a date, use groupby + transform to obtain the first date for each ID, then we can keep only rows within 2 years with a simple Boolean mask.

I've left the extra columns in for illustration, but drop them if you'd like.

import pandas as pd

df['date'] = pd.to_datetime(df.assign(day=1)[['year', 'month', 'day']])
df['first_d'] = df.groupby('ids').date.transform('min')
m = df.date <= df.first_d + pd.offsets.DateOffset(years=2)

df.loc[m]

    ids  year  month  visit_counts       date    first_d
0   112  2012      1             5 2012-01-01 2012-01-01
1   112  2012      5            25 2012-05-01 2012-01-01
2   414  2012      2            10 2012-02-01 2012-02-01
3   112  2013      5            10 2013-05-01 2012-01-01
4   414  2012      4            35 2012-04-01 2012-02-01
5   112  2013      8            16 2013-08-01 2012-01-01
6   112  2013     12            23 2013-12-01 2012-01-01
9   414  2013      6            15 2013-06-01 2012-02-01
10  414  2013     11            30 2013-11-01 2012-02-01
11  414  2014      2            24 2014-02-01 2012-02-01

Here is one way

s=df.eval('year*12+month')
df[s<s.groupby(df['ids']).transform('min')+24]
    ids  year  month  visit_counts
0   112  2012      1             5
1   112  2012      5            25
2   414  2012      2            10
3   112  2013      5            10
4   414  2012      4            35
5   112  2013      8            16
6   112  2013     12            23
9   414  2013      6            15
10  414  2013     11            30

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM