简体   繁体   中英

grouping by count, year and displaying the last occurence and its count

In the following dataframe

d = {'year': [2001, 2002, 2005, 2002, 2004, 1999, 1890],
     'tin': [12, 23, 24, 28,30, 12,7],
     'ptin': [12, 23, 28, 22, 12, 12,0] }

df = pd.DataFrame(data=d)

If I run following code:

df = (df.groupby(['ptin', 'tin', 'year'])
                  .apply(lambda x : x['tin'].isin(x['ptin']).astype(int).sum())
                  .reset_index(name='matches'))
df

I get following result

    ptin    tin   year   matches
0   12      3.0   1999   0
1   12      3.0   2001   0
2   22      1.0   2002   0
3   23      1.0   2002   0

This gives me the matching tin to ptin and groups by year.

Now if I want to find the last occurence of say for example tin == 12, I should get 2001. I want add that column as well as difference between 1999 and 2001, which is two in different column, such that my answer looks like below

    ptin    tin   year   matches    lastoccurence   length 
0   12      3.0   1999   0            0               0
1   12      3.0   2001   0            2001            2
2   22      1.0   2002   0            2002            1
3   23      1.0   2002   0            2002            1

Any help would be appreciated. I could take solution in either pandas or SQL if that is possible.

I think this will do magic (at least partially?):

df['duration'] = df.sort_values(['ptin','year']).groupby('ptin')['year'].diff()
df = df.dropna(subset=['duration'])
print (df)

     ptin  tin  year  matches  duration
2    12    12  2001        1       2.0
3    12    30  2004        0       3.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM