简体   繁体   中英

How to generate cumulative unique count at a group level in python?

I have some hospital visit healthcare data in a dataframe of the form:

record_id client_id date_of_encounter hospital_id
1 MK456 2014-01-01 01J
2 JJ103 2016-04-01 02J
3 MK456 2014-02-26 01J
4 JJ103 2016-05-01 02H
5 MK456 2014-03-01 02H
6 JJ103 2016-06-06 02J

I want to create a column hospital_count which is a cumulative count of the UNIQUE hospitals visits by each client at the date_of_counter . I have already sorted is by the client_id and
date_of_counter . The result transformation would be

record_id client_id date_of_encounter hospital_id hospital_count
1 MK456 2014-01-01 01J 1
3 MK456 2014-02-26 01J 1
5 MK456 2014-03-01 02H 2
2 JJ103 2016-04-01 02J 1
4 JJ103 2016-05-01 02H 2
6 JJ103 2016-06-06 02J 2

Some suggest using a combination of a groupby and cumsum() but I am not too sure how?

Using GoupBy.cumcount

Cumulative count of the number of distinct hospitals visited by each client

import pandas as pd

df = pd.DataFrame({
  'record_id': list(range(1,7)),
  'client_id':['MK', 'JJ', 'MK', 'JJ', 'MK', 'JJ'],
  'date': [20140101, 20160401,20140226,20160501,20140301,20160606],
  'hospital': ['1j', '2j', '1j', '2h', '2h', '2j']
})

df.sort_values(by=['client_id', 'date'], inplace=True)

df['hospital_count'] = df.drop_duplicates(subset=['client_id', 'hospital']
  ).groupby('client_id').cumcount() + 1

df.fillna(method='ffill', inplace=True)

print(df)
#    record_id client_id      date hospital  hospital_count
# 1          2        JJ  20160401       2j             1.0
# 3          4        JJ  20160501       2h             2.0
# 5          6        JJ  20160606       2j             2.0
# 0          1        MK  20140101       1j             1.0
# 2          3        MK  20140226       1j             1.0
# 4          5        MK  20140301       2h             2.0

Explanation: We drop successive visits of the same client to the same hospital using drop_duplicates ; then we can simply count the visits of each client using groupby and cumcount . However, this leaves NaN values in the lines that were dropped; we fill those values using fillna .

Cumulative count of the number of visits of each client to each hospital

import pandas as pd

df = pd.DataFrame({
  'record_id': list(range(1,7)),
  'client_id':['MK', 'JJ', 'MK', 'JJ', 'MK', 'JJ'],
  'date': [20140101, 20160401,20140226,20160501,20140301,20160606],
  'hospital': ['1j', '2j', '1j', '2h', '2h', '2j']
})

df['hospital_count'] = df.sort_values(by=['client_id', 'hospital', 'date']
  ).groupby(['client_id', 'hospital']
  ).cumcount() + 1

print(df)
#    record_id client_id      date hospital  hospital_count
# 0          1        MK  20140101       1j               1
# 1          2        JJ  20160401       2j               1
# 2          3        MK  20140226       1j               2
# 3          4        JJ  20160501       2h               1
# 4          5        MK  20140301       2h               1
# 5          6        JJ  20160606       2j               2

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM