In the following dataframe
d = {'year': [2001, 2002, 2005, 2002, 2004, 1999, 1890],
'tin': [12, 23, 24, 28,30, 12,7],
'ptin': [12, 23, 28, 22, 12, 12,0] }
df = pd.DataFrame(data=d)
If I run following code:
df = (df.groupby(['ptin', 'tin', 'year'])
.apply(lambda x : x['tin'].isin(x['ptin']).astype(int).sum())
.reset_index(name='matches'))
df
I get following result
ptin tin year matches
0 12 3.0 1999 0
1 12 3.0 2001 0
2 22 1.0 2002 0
3 23 1.0 2002 0
This gives me the matching tin to ptin and groups by year.
Now if I want to find the last occurence of say for example tin == 12, I should get 2001. I want add that column as well as difference between 1999 and 2001, which is two in different column, such that my answer looks like below
ptin tin year matches lastoccurence length
0 12 3.0 1999 0 0 0
1 12 3.0 2001 0 2001 2
2 22 1.0 2002 0 2002 1
3 23 1.0 2002 0 2002 1
Any help would be appreciated. I could take solution in either pandas or SQL if that is possible.
I think this will do magic (at least partially?):
df['duration'] = df.sort_values(['ptin','year']).groupby('ptin')['year'].diff()
df = df.dropna(subset=['duration'])
print (df)
ptin tin year matches duration
2 12 12 2001 1 2.0
3 12 30 2004 0 3.0
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.