I have a dataset in the following format:
Patient Date colA colB
1 1/3/2015 . 5
1 2/5/2015 3 10
1 3/5/2016 8 .
2 4/5/2014 2 .
2 etc
I am trying to define a function in PANDAS which treats unique patients as an item and iterates over these unique patient items to keep only to most recent observation per column (replacing all other values with missing or null). For example: for patient 1, the output would entail -
Patient Date colA colB
1 1/3/2015 . .
1 2/5/2015 . 10
1 3/5/2016 8 .
I understand that I can use something like the following with .apply(), but this does not account for duplicate patient IDs...
def getrecentobs():
for i in df['Patient']:
etc
Any help or direction is much appreciated.
I think you can use to_numeric
for convert values .
to NaN
, then create mask
with groupby
and rank
and last apply mask
:
print df
Patient Date colA colB
0 1 1/3/2015 . 5
1 1 2/5/2015 3 10
2 1 3/5/2016 8 .
3 2 4/5/2014 2 .
4 2 5/5/2014 4 .
df['colA'] = pd.to_numeric(df['colA'], errors='coerce')
df['colB'] = pd.to_numeric(df['colB'], errors='coerce')
print df
Patient Date colA colB
0 1 1/3/2015 NaN 5
1 1 2/5/2015 3 10
2 1 3/5/2016 8 NaN
3 2 4/5/2014 2 NaN
4 2 5/5/2014 4 NaN
print df.groupby('Patient')[['colA','colB']].rank(method='max', ascending=False)
colA colB
0 NaN 2
1 2 1
2 1 NaN
3 2 NaN
4 1 NaN
mask = df.groupby('Patient')[['colA','colB']].rank(method='max', ascending=False) == 1
print mask
colA colB
0 False False
1 False True
2 True False
3 False False
4 True False
df[['colA','colB']] = df[['colA','colB']][mask]
print df
Patient Date colA colB
0 1 1/3/2015 NaN NaN
1 1 2/5/2015 NaN 10
2 1 3/5/2016 8 NaN
3 2 4/5/2014 NaN NaN
4 2 5/5/2014 4 NaN
There is a function in pandas called last which can be used with groupby to give you the last values for a given groupby. I'm not sure why you require the blank rows but if you need them you can join the groupby back on the original data frame. Sorry the sort is there as the date was not sorted in my sample data. Hope that helps.
Example:
DataFrame
id date amount code
0 3107 2010-10-20 136.4004 290
1 3001 2010-10-08 104.1800 290
2 3109 2010-10-08 276.0629 165
3 3001 2010-10-08 -177.9800 290
4 3002 2010-10-08 1871.1094 290
5 3109 2010-10-08 225.7038 155
6 3109 2010-10-08 98.5578 170
7 3107 2010-10-08 231.3949 165
8 3203 2010-10-08 333.6636 290
9 -9100 2010-10-08 3478.7500 290
If previous rows not needed:
b.sort_values("date").groupby(["id","date"]).last().reset_index()
The groupby aggregates the data by the "last" meaning the last value for those columns.
Output only latest rows with values:
id date amount code
0 -9100 2010-10-08 3478.7500 290
1 3001 2010-10-08 -177.9800 290
2 3002 2010-10-08 1871.1094 290
3 3107 2010-10-08 231.3949 165
4 3107 2010-10-20 136.4004 290
5 3109 2010-10-08 98.5578 170
6 3203 2010-10-08 333.6636 290
I think you are looking for pandas
groupby .
For example, df.groubpy('Patient').last()
will return a DataFrame with the last observation of each patient . If the patients are not sorted by date
you can find the latest record date using max
function.
df.groupby('Patient').last()
Date colA colB
Patient
1 3/5/2016 8 .
2 etc 2 .
You can make your own functions and then call the apply()
function of groupby .
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.