简体   繁体   English

如何用数字和字母对字符串进行排序以便在python中进行分类?

[英]How to order strings with numbers and letters in order to categorize in python?

I am working with a dataset and I have some diagnoses classified in ICD10.我正在处理一个数据集,并且我有一些分类在 ICD10 中的诊断。 However, since I have a lot of different codes, I want to classify them in bigger categories.但是,由于我有很多不同的代码,我想将它们归入更大的类别。 SO I found in internet that categories.所以我在互联网上找到了这些类别。 The problem is that the codes are like 'A04' or 'Z01', and I can't order them because they are a mix of letters and numbers.问题是代码类似于“A04”或“Z01”,我无法订购它们,因为它们是字母和数字的混合体。 I tried that code below, but I did know that the variable 'diag_icd10_ranges' is not okay.我尝试了下面的代码,但我确实知道变量“diag_icd10_ranges”不好。 Anyone can help me, please?任何人都可以帮助我,好吗?

df['code_diag_assoc_icd10'] = df['Assoc_Diagnose']

# Associated category names
diag_icd10_ranges = [(A00, B99), (C00, D49), (D50, D89), (E00, E89), (F01, F99), (G00, G99), 
       (H00, H59), (H60, H95), (I00, I99), (J00, J99), (K00, K95), (L00, L99),
       (M00, M99), (N00, N99), (O00, O9A), (P00, P96), (Q00, Q99), (R00, R99),
       (S00, T88), (V00, Y99), (Z00, Z99)]

diag_icd10_dict = {0: 'infectious_icd10d', 1: 'neoplasms_icd10d', 2: 'blood_icd10d', 3: 'endocrine_icd10d',
           4: 'mental_icd10d', 5: 'nervous_icd10d', 6: 'eye_icd10d', 7: 'ear_icd10d',
           8: 'circulatory_icd10d', 9: 'respiratory_icd10d', 10: 'digestive_icd10d', 11: 'skin_icd10d', 
          12: 'musculo_icd10d', 13: 'genitourinary_icd10d', 14: 'pregnancy_icd10d', 15: 'perinatalperiod_icd10d', 
          16: 'congenital_icd10d',
          17: 'abnormalfindings_icd10d', 18:'injury_icd10d', 19:'morbidity', 20:'healthstatus'}

# Re-code in terms of integer
for num, cat_range in enumerate(diag_icd10_ranges):
df['code_diag_assoc_icd10'] = np.where(df['code_diag_assoc_icd10'].between(cat_range[0],cat_range[1]), 
                                   num, df['code_diag_assoc_icd10'])

# Convert integer to category name using diag_dict
df['cat_diag_assoc_icd10'] = df['code_diag_assoc_icd10'].replace(proc_icd10_dict)

You should be able to use the pythonic way of doing between.您应该能够在两者之间使用 pythonic 方式。 See the code below.请参阅下面的代码。

In [21]: diag_icd10_ranges = [{ 1 : ('A00', 'B99') }, 
    ...:                      { 2 : ('C00', 'D49') }, 
    ...:                      { 3 : ('D50', 'D89') }, 
    ...:                      { 4 : ('E00', 'E89') }, 
    ...:                      { 5 : ('F01', 'F99') }, 
    ...:                      { 6 : ('G00', 'G99') }, 
    ...:                      { 7 : ('H00', 'H59') }, 
    ...:                      { 8 : ('H60', 'H95') }, 
    ...:                      { 9 : ('I00', 'I99') }, 
    ...:                      { 10: ('J00', 'J99') }, 
    ...:                      { 11: ('K00', 'K95') }, 
    ...:                      { 12: ('L00', 'L99') }, 
    ...:                      { 13: ('M00', 'M99') }, 
    ...:                      { 14: ('N00', 'N99') }, 
    ...:                      { 15: ('O00', 'O9A') }, 
    ...:                      { 16: ('P00', 'P96') }, 
    ...:                      { 17: ('Q00', 'Q99') }, 
    ...:                      { 18: ('R00', 'R99') }, 
    ...:                      { 19: ('S00', 'T88') }, 
    ...:                      { 20: ('V00', 'Y99') }, 
    ...:                      { 21: ('Z00', 'Z99') }
    ...:                     ]
    ...: 
    ...: heart_failure_icd10_code = 'I50.9'
    ...: 
    ...: chapter_number = [key for rec in diag_icd10_ranges for key, value in rec.items() if value[0] <= heart_failure_icd10_code <= value[1] ]

In [22]: print(chapter_number)
[9]

In [23]: 

You can use bisect_left with your ranges expressed using only their lower bounds:您可以将 bisect_left 与仅使用其下限表示的范围一起使用:

from bisect import bisect_left

ranges = ["C00","D50","E00","F00","G00","H00","H60","I00",
          "J00","K00","L00","M00","N00","O00","P00","Q00","P00",
          "Q00","R00","S00","V00","Z00"]

def icdGroup(code): return bisect_left(ranges,code)

icdGroup("B20") # 0
icdGroup("H65") # 7

All codes from blank to < C00 will be at index 0, from C00 to < D50 will be at index 1, ... and so on.从空白到 < C00 的所有代码都将在索引 0 处,从 C00 到 < D50 将在索引 1 处,...依此类推。 Codes >= Z00 will be at index 22.代码 >= Z00 将位于索引 22。

bisect_left will give you O(log(22)) performance so if you have a lot of codes to categorize this will be a lot more efficient than sequential searches. bisect_left 会给你 O(log(22)) 性能,所以如果你有很多代码来分类,这将比顺序搜索更有效。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM