I have some hospital visit healthcare data in a dataframe of the form:
record_id | client_id | date_of_encounter | hospital_id |
---|---|---|---|
1 | MK456 | 2014-01-01 | 01J |
2 | JJ103 | 2016-04-01 | 02J |
3 | MK456 | 2014-02-26 | 01J |
4 | JJ103 | 2016-05-01 | 02H |
5 | MK456 | 2014-03-01 | 02H |
6 | JJ103 | 2016-06-06 | 02J |
I want to create a column hospital_count
which is a cumulative count of the UNIQUE hospitals visits by each client at the date_of_counter
. I have already sorted is by the client_id
anddate_of_counter
. The result transformation would be
record_id | client_id | date_of_encounter | hospital_id | hospital_count |
---|---|---|---|---|
1 | MK456 | 2014-01-01 | 01J | 1 |
3 | MK456 | 2014-02-26 | 01J | 1 |
5 | MK456 | 2014-03-01 | 02H | 2 |
2 | JJ103 | 2016-04-01 | 02J | 1 |
4 | JJ103 | 2016-05-01 | 02H | 2 |
6 | JJ103 | 2016-06-06 | 02J | 2 |
Some suggest using a combination of a groupby
and cumsum()
but I am not too sure how?
Using GoupBy.cumcount
Cumulative count of the number of distinct hospitals visited by each client
import pandas as pd
df = pd.DataFrame({
'record_id': list(range(1,7)),
'client_id':['MK', 'JJ', 'MK', 'JJ', 'MK', 'JJ'],
'date': [20140101, 20160401,20140226,20160501,20140301,20160606],
'hospital': ['1j', '2j', '1j', '2h', '2h', '2j']
})
df.sort_values(by=['client_id', 'date'], inplace=True)
df['hospital_count'] = df.drop_duplicates(subset=['client_id', 'hospital']
).groupby('client_id').cumcount() + 1
df.fillna(method='ffill', inplace=True)
print(df)
# record_id client_id date hospital hospital_count
# 1 2 JJ 20160401 2j 1.0
# 3 4 JJ 20160501 2h 2.0
# 5 6 JJ 20160606 2j 2.0
# 0 1 MK 20140101 1j 1.0
# 2 3 MK 20140226 1j 1.0
# 4 5 MK 20140301 2h 2.0
Explanation: We drop successive visits of the same client to the same hospital using drop_duplicates
; then we can simply count the visits of each client using groupby
and cumcount
. However, this leaves NaN
values in the lines that were dropped; we fill those values using fillna
.
Cumulative count of the number of visits of each client to each hospital
import pandas as pd
df = pd.DataFrame({
'record_id': list(range(1,7)),
'client_id':['MK', 'JJ', 'MK', 'JJ', 'MK', 'JJ'],
'date': [20140101, 20160401,20140226,20160501,20140301,20160606],
'hospital': ['1j', '2j', '1j', '2h', '2h', '2j']
})
df['hospital_count'] = df.sort_values(by=['client_id', 'hospital', 'date']
).groupby(['client_id', 'hospital']
).cumcount() + 1
print(df)
# record_id client_id date hospital hospital_count
# 0 1 MK 20140101 1j 1
# 1 2 JJ 20160401 2j 1
# 2 3 MK 20140226 1j 2
# 3 4 JJ 20160501 2h 1
# 4 5 MK 20140301 2h 1
# 5 6 JJ 20160606 2j 2
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.