[英]How to order strings with numbers and letters in order to categorize in python?
I am working with a dataset and I have some diagnoses classified in ICD10.我正在处理一个数据集,并且我有一些分类在 ICD10 中的诊断。 However, since I have a lot of different codes, I want to classify them in bigger categories.但是,由于我有很多不同的代码,我想将它们归入更大的类别。 SO I found in internet that categories.所以我在互联网上找到了这些类别。 The problem is that the codes are like 'A04' or 'Z01', and I can't order them because they are a mix of letters and numbers.问题是代码类似于“A04”或“Z01”,我无法订购它们,因为它们是字母和数字的混合体。 I tried that code below, but I did know that the variable 'diag_icd10_ranges' is not okay.我尝试了下面的代码,但我确实知道变量“diag_icd10_ranges”不好。 Anyone can help me, please?任何人都可以帮助我,好吗?
df['code_diag_assoc_icd10'] = df['Assoc_Diagnose']
# Associated category names
diag_icd10_ranges = [(A00, B99), (C00, D49), (D50, D89), (E00, E89), (F01, F99), (G00, G99),
(H00, H59), (H60, H95), (I00, I99), (J00, J99), (K00, K95), (L00, L99),
(M00, M99), (N00, N99), (O00, O9A), (P00, P96), (Q00, Q99), (R00, R99),
(S00, T88), (V00, Y99), (Z00, Z99)]
diag_icd10_dict = {0: 'infectious_icd10d', 1: 'neoplasms_icd10d', 2: 'blood_icd10d', 3: 'endocrine_icd10d',
4: 'mental_icd10d', 5: 'nervous_icd10d', 6: 'eye_icd10d', 7: 'ear_icd10d',
8: 'circulatory_icd10d', 9: 'respiratory_icd10d', 10: 'digestive_icd10d', 11: 'skin_icd10d',
12: 'musculo_icd10d', 13: 'genitourinary_icd10d', 14: 'pregnancy_icd10d', 15: 'perinatalperiod_icd10d',
16: 'congenital_icd10d',
17: 'abnormalfindings_icd10d', 18:'injury_icd10d', 19:'morbidity', 20:'healthstatus'}
# Re-code in terms of integer
for num, cat_range in enumerate(diag_icd10_ranges):
df['code_diag_assoc_icd10'] = np.where(df['code_diag_assoc_icd10'].between(cat_range[0],cat_range[1]),
num, df['code_diag_assoc_icd10'])
# Convert integer to category name using diag_dict
df['cat_diag_assoc_icd10'] = df['code_diag_assoc_icd10'].replace(proc_icd10_dict)
You should be able to use the pythonic way of doing between.您应该能够在两者之间使用 pythonic 方式。 See the code below.请参阅下面的代码。
In [21]: diag_icd10_ranges = [{ 1 : ('A00', 'B99') },
...: { 2 : ('C00', 'D49') },
...: { 3 : ('D50', 'D89') },
...: { 4 : ('E00', 'E89') },
...: { 5 : ('F01', 'F99') },
...: { 6 : ('G00', 'G99') },
...: { 7 : ('H00', 'H59') },
...: { 8 : ('H60', 'H95') },
...: { 9 : ('I00', 'I99') },
...: { 10: ('J00', 'J99') },
...: { 11: ('K00', 'K95') },
...: { 12: ('L00', 'L99') },
...: { 13: ('M00', 'M99') },
...: { 14: ('N00', 'N99') },
...: { 15: ('O00', 'O9A') },
...: { 16: ('P00', 'P96') },
...: { 17: ('Q00', 'Q99') },
...: { 18: ('R00', 'R99') },
...: { 19: ('S00', 'T88') },
...: { 20: ('V00', 'Y99') },
...: { 21: ('Z00', 'Z99') }
...: ]
...:
...: heart_failure_icd10_code = 'I50.9'
...:
...: chapter_number = [key for rec in diag_icd10_ranges for key, value in rec.items() if value[0] <= heart_failure_icd10_code <= value[1] ]
In [22]: print(chapter_number)
[9]
In [23]:
You can use bisect_left with your ranges expressed using only their lower bounds:您可以将 bisect_left 与仅使用其下限表示的范围一起使用:
from bisect import bisect_left
ranges = ["C00","D50","E00","F00","G00","H00","H60","I00",
"J00","K00","L00","M00","N00","O00","P00","Q00","P00",
"Q00","R00","S00","V00","Z00"]
def icdGroup(code): return bisect_left(ranges,code)
icdGroup("B20") # 0
icdGroup("H65") # 7
All codes from blank to < C00 will be at index 0, from C00 to < D50 will be at index 1, ... and so on.从空白到 < C00 的所有代码都将在索引 0 处,从 C00 到 < D50 将在索引 1 处,...依此类推。 Codes >= Z00 will be at index 22.代码 >= Z00 将位于索引 22。
bisect_left will give you O(log(22)) performance so if you have a lot of codes to categorize this will be a lot more efficient than sequential searches. bisect_left 会给你 O(log(22)) 性能,所以如果你有很多代码来分类,这将比顺序搜索更有效。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.