
[英]pandas: plot .value_counts() of same column from two different dataframes
[英]Combine column counts from different dataframes pandas
我有两个数据框。 一个有关于患者的人口统计信息,另一个有一些特征信息。 下面是一些代表我的数据集的虚拟数据:
人口统计:
demographics = {
'PatientID': [10, 11, 12, 13],
'DOB': ['1971-10-23', '1969-06-18', '1973-04-20', '1971-05-31'],
'Sex': ['M', 'M', 'F', 'M'],
'Flag': [0, 1, 0, 0]
}
demographics = pd.DataFrame(demographics)
demographics['DOB'] = pd.to_datetime(demographics['DOB'])
这是打印的数据框:
print(demographics)
PatientID DOB Sex Flag
0 10 1971-10-23 M 0
1 11 1969-06-18 M 1
2 12 1973-04-20 F 0
3 13 1971-05-31 M 0
特征:
features = {
'PatientID': [10, 10, 10, 10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12],
'Feature': ['A', 'B', 'A', 'A', 'C', 'B', 'C', 'A', 'B', 'B', 'A', 'C', 'D', 'A', 'B', 'C', 'C', 'D', 'D', 'D', 'B', 'C', 'C', 'C', 'B', 'B', 'C'],
}
features = pd.DataFrame(features)
以下是每个患者的每个特征的计数:
print(features.groupby(['PatientID', 'Feature']).size())
PatientID Feature
10 A 3
B 2
C 2
11 A 3
B 3
C 3
D 1
12 B 3
C 4
D 3
dtype: int64
我想将每个患者的特征计数整合到人口统计表中。 请注意,特征表中没有患者 13。 最终的数据框将如下所示:
result = {
'PatientID': [10, 11, 12, 13],
'DOB': ['1971-10-23', '1969-06-18', '1973-04-20', '1971-05-31'],
'Feature_A': [3, 3, 0, 0],
'Feature_B': [2, 3, 3, 0],
'Feature_C': [2, 3, 4, 0],
'Feature_D': [0, 1, 3, 0],
'Sex': ['M', 'M', 'F', 'M'],
'Flag': [0, 1, 0, 0],
}
result = pd.DataFrame(result)
result['DOB'] = pd.to_datetime(result['DOB'])
print(result)
PatientID DOB Feature_A Feature_B Feature_C Feature_D Sex Flag
0 10 1971-10-23 3 2 2 0 M 0
1 11 1969-06-18 3 3 3 1 M 1
2 12 1973-04-20 0 3 4 3 F 0
3 13 1971-05-31 0 0 0 0 M 0
我怎样才能从这两个数据框中得到这个结果?
交叉制表features
并与demographics
合并。
# cross-tabulate feature df
# and reindex it by PatientID to carry PatientIDs without features
feature_counts = (
pd.crosstab(features['PatientID'], features['Feature'])
.add_prefix('Feature_')
.reindex(demographics['PatientID'], fill_value=0)
)
# merge the two
demographics.merge(feature_counts, on='PatientID')
修复添加unstack
的代码
out = (features.groupby(['PatientID', 'Feature']).size().
unstack(fill_value=0).
add_prefix('Feature_').
reindex(demographics['PatientID'],fill_value=0).
reset_index().
merge(demographics))
Out[30]:
PatientID Feature_A Feature_B Feature_C Feature_D DOB Sex Flag
0 10 3 2 2 0 1971-10-23 M 0
1 11 3 3 3 1 1969-06-18 M 1
2 12 0 3 4 3 1973-04-20 F 0
3 13 0 0 0 0 1971-05-31 M 0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.