简体   繁体   中英

parse list containing html-like elements into nested json using Python

I'm not the best at converting certain sections of a list to nested Json and was hoping for some guidance. I have a list containing data like below:

"<h5>1",
    "<h6>Type of Care|",
    "<h6>SA|",
    "<h6>Type of Care|",
    "<h6>Substance use treatment|",
    "<h6>DT Detoxification |",
    "<h6>HH Transitional housing, halfway house, or sober home|",
    "<h6>SUMH |",
    "<h6>Treatment for co-occurring serious mental health  illness/serious emotional disturbance and substance  use disorders|",
    "",
    "<h5>2",
    "<h6>Telemedicine|",
    "<h6>TELE|",
    "<h6>Telemedicine|",
    "<h6>Telemedicine/telehealth|",
    "",
    "<h5>3 |",
    "",
    "<h6>Service Settings (e.g., Outpatient, |",
    "<h6>Residential, Inpatient, etc.)|",
    "<h6>HI|",
    "<h6>Service Settings (e.g., Outpatient, |",
    "<h6>Residential, Inpatient, etc.)|",
    "<h6>Hospital inpatient |",
    "<h6>OP Outpatient |",
    "<h6>RES Residential|",
    "<h6>HID Hospital inpatient detoxification|",
    "<h6>HIT Hospital inpatient treatment|",
    "<h6>OD Outpatient detoxification|",
    "<h6>ODT Outpatient day treatment or partial hospitalization|",
    "<h6>OIT Intensive outpatient treatment|",
    "<h6>OMB |",
    "<h6>Outpatient methadone/buprenorphine or  naltrexone treatment|",
    "<h6>ORT Regular outpatient treatment|",
    "<h6>RD Residential detoxification|",
    "<h6>RL Long-term residential|",
    "<h6>RS Short-term residential|"]

I want to first remove all records in the list that have no content, then I want to convert the records that contain a tag like "<H5>" into the key and group the records that contain "<h6>" into values like this json output:

"codekey": [
                {
                    "category": [
                        {
                            "key": 1,
                            "value": "Type of Care"
                        }
                    ],
                    "codes": [
                        {
                            "key": "SA",
                            "value": "Substance use treatment"
                        },
                        {
                            "key": "DT",
                            "value": "Detoxification"
                        },
                        {
                            "key": "HH",
                            "value": "Transitional housing, halfway house, or sober home"
                        },
                        {
                            "key": "SUMH",
                            "value": "Treatment for co-occurring serious mental health | illness/serious emotional disturbance and substance | use disorders|"
                        }
                    ]
                },
                {
                    "category": [
                        {
                            "key": 2,
                            "value": "Telemedicine"
                        }
                    ],
                    "codes": [
                        {
                            "key": "TELE",
                            "value": "TelemedicineTelemedicine/telehealth"
                    
                        }
                    ]
                }
            ], etc....

I think I need to perform a loop but I'm getting stuck on how to create the 'key/value' relationship. I think I also need to use a regex but I'm just not the best at Python to conceptually convert the data to the required output. Any advice on training I could look up to do this OR any preliminary suggestions on how to get started? Thank you!

Considering your format remains constant. Here's a flexible solution that is configurable:

class Separator():
    def __init__(self, data, title, sep, splitter):
        self.data = data # the data
        self.title = title # the starting in your case "<h5>"
        self.sep = sep # the point where you want to update res
        self.splitter = splitter # the separator between key | value
        self.res = [] # final res
        self.tempDict = {} # tempDict to append
    def clearString(self, string, *args):
        for arg in args:
            string = string.replace(arg, '') # replace every arg to ''
        return string.strip()
    def updateDict(self, val):
        if val == self.sep:
            self.res.append(self.tempDict) # update res
            self.tempDict = {} # renew tempDict to append
        else:
            try:
                if self.title in val: # check if it "<h5>" in your case
                    self.tempDict["category"] = [{"key": self.clearString(val, self.title, self.splitter), "value": self.clearString(self.data[self.data.index(val)+1],'<h6>', '|')}] # get the next value
                elif self.tempDict["category"][0]["value"] != self.clearString(val, '<h6>', '|'): # check if it is not the "value" of h6 in "category"
                    val = self.clearString(val,"<h6>").split("|")
                    if "codes" not in self.tempDict.keys(): self.tempDict["codes"] = [] # create key if not there
                    self.tempDict["codes"].append({"key": val[0], "value": val[1]})
            except: # avoid Exceptions
                pass
        return self.res
object = Separator(data, '<h5>', '', '|')
for val in data:
    res = object.updateDict(val)
print(res)

Output for your Sample Input Provided:

[
    {
        'category': [{'key': '1', 'value': 'Type of Care'}],
        'codes': [
            {'key': 'SA', 'value': 'Substance use treatment'},
            {'key': 'DT', 'value': 'Detoxification '},
            {
                'key': 'HH',
                'value': 'Transitional housing, halfway house, or sober home',
            },
            {
                'key': 'SUMH',
                'value': 'Treatment for co-occurring serious mental health ',
            },
        ],
    },
    {
        'category': [{'key': '2', 'value': 'Telemedicine'}],
        'codes': [
            {'key': 'TELE', 'value': 'TelemedicineTelemedicine/telehealth'},
        ],
    },
]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM