如何在 python 中生成组级别的累积唯一计数？

Question

I have some hospital visit healthcare data in a dataframe of the form:我有一些医院访问医疗保健数据，格式为 dataframe：

record_id记录编号	client_id client_id	date_of_encounter相遇日期	hospital_id医院编号
1 1个	MK456 MK456	2014-01-01 2014-01-01	01J 01J
2 2个	JJ103 JJ103	2016-04-01 2016-04-01	02J 02J
3 3个	MK456 MK456	2014-02-26 2014-02-26	01J 01J
4 4个	JJ103 JJ103	2016-05-01 2016-05-01	02H 02H
5 5个	MK456 MK456	2014-03-01 2014-03-01	02H 02H
6 6个	JJ103 JJ103	2016-06-06 2016-06-06	02J 02J

I want to create a column hospital_count which is a cumulative count of the UNIQUE hospitals visits by each client at the date_of_counter .我想创建一个hospital_count列，它是每个客户在date_of_counter就诊的 UNIQUE 医院的累计计数。 I have already sorted is by the client_id and我已经按client_id和
date_of_counter . date_of_counter 。 The result transformation would be结果转换将是

record_id记录编号	client_id client_id	date_of_encounter相遇日期	hospital_id医院编号	hospital_count医院计数
1 1个	MK456 MK456	2014-01-01 2014-01-01	01J 01J	1 1个
3 3个	MK456 MK456	2014-02-26 2014-02-26	01J 01J	1 1个
5 5个	MK456 MK456	2014-03-01 2014-03-01	02H 02H	2 2个
2 2个	JJ103 JJ103	2016-04-01 2016-04-01	02J 02J	1 1个
4 4个	JJ103 JJ103	2016-05-01 2016-05-01	02H 02H	2 2个
6 6个	JJ103 JJ103	2016-06-06 2016-06-06	02J 02J	2 2个

Some suggest using a combination of a groupby and cumsum() but I am not too sure how?有人建议结合使用groupby和cumsum()但我不太确定如何使用？

Answer 1

Using GoupBy.cumcount使用GoupBy.cumcount

Cumulative count of the number of distinct hospitals visited by each client每个客户访问的不同医院的累计数量

import pandas as pd

df = pd.DataFrame({
  'record_id': list(range(1,7)),
  'client_id':['MK', 'JJ', 'MK', 'JJ', 'MK', 'JJ'],
  'date': [20140101, 20160401,20140226,20160501,20140301,20160606],
  'hospital': ['1j', '2j', '1j', '2h', '2h', '2j']
})

df.sort_values(by=['client_id', 'date'], inplace=True)

df['hospital_count'] = df.drop_duplicates(subset=['client_id', 'hospital']
  ).groupby('client_id').cumcount() + 1

df.fillna(method='ffill', inplace=True)

print(df)
#    record_id client_id      date hospital  hospital_count
# 1          2        JJ  20160401       2j             1.0
# 3          4        JJ  20160501       2h             2.0
# 5          6        JJ  20160606       2j             2.0
# 0          1        MK  20140101       1j             1.0
# 2          3        MK  20140226       1j             1.0
# 4          5        MK  20140301       2h             2.0

Explanation: We drop successive visits of the same client to the same hospital using drop_duplicates ;解释：我们使用drop_duplicates删除同一客户对同一家医院的连续访问； then we can simply count the visits of each client using groupby and cumcount .然后我们可以使用groupby和cumcount简单地计算每个客户的访问。 However, this leaves NaN values in the lines that were dropped;但是，这会在删除的行中留下NaN值； we fill those values using fillna .我们使用fillna填充这些值。

Cumulative count of the number of visits of each client to each hospital每位客户到每家医院的累计就诊次数

import pandas as pd

df = pd.DataFrame({
  'record_id': list(range(1,7)),
  'client_id':['MK', 'JJ', 'MK', 'JJ', 'MK', 'JJ'],
  'date': [20140101, 20160401,20140226,20160501,20140301,20160606],
  'hospital': ['1j', '2j', '1j', '2h', '2h', '2j']
})

df['hospital_count'] = df.sort_values(by=['client_id', 'hospital', 'date']
  ).groupby(['client_id', 'hospital']
  ).cumcount() + 1

print(df)
#    record_id client_id      date hospital  hospital_count
# 0          1        MK  20140101       1j               1
# 1          2        JJ  20160401       2j               1
# 2          3        MK  20140226       1j               2
# 3          4        JJ  20160501       2h               1
# 4          5        MK  20140301       2h               1
# 5          6        JJ  20160606       2j               2

如何在 python 中生成组级别的累积唯一计数？

问题描述

1 个解决方案

解决方案1
0 2021-08-12 14:39:31

如何在 python 中生成组级别的累积唯一计数？

问题描述

1 个解决方案

解决方案1 0 2021-08-12 14:39:31

解决方案1
0 2021-08-12 14:39:31