[英]How to generate cumulative unique count at a group level in python?
I have some hospital visit healthcare data in a dataframe of the form:我有一些医院访问医疗保健数据,格式为 dataframe:
record_id![]() |
client_id ![]() |
date_of_encounter![]() |
hospital_id![]() |
---|---|---|---|
1 ![]() |
MK456 ![]() |
2014-01-01 ![]() |
01J ![]() |
2 ![]() |
JJ103 ![]() |
2016-04-01 ![]() |
02J ![]() |
3 ![]() |
MK456 ![]() |
2014-02-26 ![]() |
01J ![]() |
4 ![]() |
JJ103 ![]() |
2016-05-01 ![]() |
02H ![]() |
5 ![]() |
MK456 ![]() |
2014-03-01 ![]() |
02H ![]() |
6 ![]() |
JJ103 ![]() |
2016-06-06 ![]() |
02J ![]() |
I want to create a column hospital_count
which is a cumulative count of the UNIQUE hospitals visits by each client at the date_of_counter
.我想创建一个
hospital_count
列,它是每个客户在date_of_counter
就诊的 UNIQUE 医院的累计计数。 I have already sorted is by the client_id
and我已经按
client_id
和date_of_counter
. date_of_counter
。 The result transformation would be结果转换将是
record_id![]() |
client_id ![]() |
date_of_encounter![]() |
hospital_id![]() |
hospital_count![]() |
---|---|---|---|---|
1 ![]() |
MK456 ![]() |
2014-01-01 ![]() |
01J ![]() |
1 ![]() |
3 ![]() |
MK456 ![]() |
2014-02-26 ![]() |
01J ![]() |
1 ![]() |
5 ![]() |
MK456 ![]() |
2014-03-01 ![]() |
02H ![]() |
2 ![]() |
2 ![]() |
JJ103 ![]() |
2016-04-01 ![]() |
02J ![]() |
1 ![]() |
4 ![]() |
JJ103 ![]() |
2016-05-01 ![]() |
02H ![]() |
2 ![]() |
6 ![]() |
JJ103 ![]() |
2016-06-06 ![]() |
02J ![]() |
2 ![]() |
Some suggest using a combination of a groupby
and cumsum()
but I am not too sure how?有人建议结合使用
groupby
和cumsum()
但我不太确定如何使用?
Using GoupBy.cumcount
使用
GoupBy.cumcount
Cumulative count of the number of distinct hospitals visited by each client每个客户访问的不同医院的累计数量
import pandas as pd
df = pd.DataFrame({
'record_id': list(range(1,7)),
'client_id':['MK', 'JJ', 'MK', 'JJ', 'MK', 'JJ'],
'date': [20140101, 20160401,20140226,20160501,20140301,20160606],
'hospital': ['1j', '2j', '1j', '2h', '2h', '2j']
})
df.sort_values(by=['client_id', 'date'], inplace=True)
df['hospital_count'] = df.drop_duplicates(subset=['client_id', 'hospital']
).groupby('client_id').cumcount() + 1
df.fillna(method='ffill', inplace=True)
print(df)
# record_id client_id date hospital hospital_count
# 1 2 JJ 20160401 2j 1.0
# 3 4 JJ 20160501 2h 2.0
# 5 6 JJ 20160606 2j 2.0
# 0 1 MK 20140101 1j 1.0
# 2 3 MK 20140226 1j 1.0
# 4 5 MK 20140301 2h 2.0
Explanation: We drop successive visits of the same client to the same hospital using drop_duplicates
;解释:我们使用
drop_duplicates
删除同一客户对同一家医院的连续访问; then we can simply count the visits of each client using groupby
and cumcount
.然后我们可以使用
groupby
和cumcount
简单地计算每个客户的访问。 However, this leaves NaN
values in the lines that were dropped;但是,这会在删除的行中留下
NaN
值; we fill those values using fillna
.我们使用
fillna
填充这些值。
Cumulative count of the number of visits of each client to each hospital每位客户到每家医院的累计就诊次数
import pandas as pd
df = pd.DataFrame({
'record_id': list(range(1,7)),
'client_id':['MK', 'JJ', 'MK', 'JJ', 'MK', 'JJ'],
'date': [20140101, 20160401,20140226,20160501,20140301,20160606],
'hospital': ['1j', '2j', '1j', '2h', '2h', '2j']
})
df['hospital_count'] = df.sort_values(by=['client_id', 'hospital', 'date']
).groupby(['client_id', 'hospital']
).cumcount() + 1
print(df)
# record_id client_id date hospital hospital_count
# 0 1 MK 20140101 1j 1
# 1 2 JJ 20160401 2j 1
# 2 3 MK 20140226 1j 2
# 3 4 JJ 20160501 2h 1
# 4 5 MK 20140301 2h 1
# 5 6 JJ 20160606 2j 2
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.