简体   繁体   中英

Python XML Parser Issue

I am new to python. Sorry for asking this stupid question. I am trying to read a XML file to python object (preferably to pandas) For now I am just trying to print the variables, to see if I can read them properly in a tabular form.

I have used xml.etree.ElementTree for this, but I might not be using it as intended.

Code:

import xml.etree.ElementTree as ET
tree = ET.parse("data.xml")
ODM = tree.getroot()

ns = {'xmlns': 'http://www.cdisc.org/ns/odm/v1.3',
      'mdsol': 'http://www.mdsol.com/ns/odm/metadata'}

for ClinicalData in ODM:
    LocationOID=None
    #print(ClinicalData.tag, ClinicalData.attrib)
    for SubjectData in ClinicalData:
        for SiteRef in SubjectData:
            LocationOID=SiteRef.attrib.get('LocationOID')
        for StudyEventData in SubjectData:
            for AuditRecord in StudyEventData:
                print(ClinicalData.attrib.get('MetaDataVersionOID'),
                     ClinicalData.attrib.get('AuditSubCategoryName'),       #null ouptput due to namespace issue
                     SubjectData.attrib.get('SubjectKey'),
                     SubjectData.attrib.get('SubjectName'),                 #null ouptput due to namespace issue
                     LocationOID,                                           #not sure what is the issue
                     StudyEventData.attrib.get('StudyEventRepeatKey'),
                     AuditRecord.find('DateTimeStamp')                      #not sure what is the issue
                    )

Input:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<ODM xmlns="http://www.cdisc.org/ns/odm/v1.3" 
        xmlns:mdsol="http://www.mdsol.com/ns/odm/metadata" 
        CreationDateTime="2019-08-23T12:59:09" FileOID="3b2b4161-fad8-4239-9c83-03d0e62624dd" FileType="Transactional" ODMVersion="1.3">

    <ClinicalData MetaDataVersionOID="1772" StudyOID="0ACC SP3 MAPPING1(DEV)" mdsol:AuditSubCategoryName="Activated">
        <SubjectData SubjectKey="7735fd9c-1792-457c-aa58-0ca26ecdc810" mdsol:SubjectKeyType="SubjectUUID" mdsol:SubjectName="ACC-SUBJ-3">
            <SiteRef LocationOID="0ACCSP3MAPPING1SITE1"/>
            <StudyEventData StudyEventOID="FV" StudyEventRepeatKey="VIST[1]/FV[1]" mdsol:InstanceId="2960580">
                <AuditRecord>
                    <UserRef UserOID="systemuser"/>
                    <LocationRef LocationOID="0ACCSP3MAPPING1SITE1"/>
                    <DateTimeStamp>2019-07-10T07:56:54</DateTimeStamp>
                    <ReasonForChange>Update</ReasonForChange>
                    <SourceID>394263772</SourceID>
                </AuditRecord>
            </StudyEventData>
        </SubjectData>
    </ClinicalData>
</ODM>

I am expecting all the print variables need to have the proper variable assigned values as in XML file. Please let me know is there any other proper way of doing it instead of inner looping multiple times.

Namespaces are a pain using ElementTree. See this discussion .

Short answer:

for ClinicalData in ODM:
    #print(ClinicalData.tag, ClinicalData.attrib)
    for SubjectData in ClinicalData:
        SiteRef = SubjectData.find('{http://www.cdisc.org/ns/odm/v1.3}SiteRef')
        LocationOID = SiteRef.attrib.get('LocationOID')
        for StudyEventData in SubjectData:
            for AuditRecord in StudyEventData:
                print(
                    ClinicalData.attrib.get('MetaDataVersionOID'),
                    ClinicalData.attrib.
                    get('{http://www.mdsol.com/ns/odm/metadata}AuditSubCategoryName'
                        ),  #null ouptput due to namespace issue
                    SubjectData.attrib.get('SubjectKey'),
                    SubjectData.attrib.get(
                        '{http://www.mdsol.com/ns/odm/metadata}SubjectName'
                    ),  #null ouptput due to namespace issue
                    LocationOID,  #not sure what is the issue
                    StudyEventData.attrib.get('StudyEventRepeatKey'),
                    AuditRecord.find(
                        '{http://www.cdisc.org/ns/odm/v1.3}DateTimeStamp').
                    text  #not sure what is the issue
                )

I think you can use BeautifulSoup for parsing XML:

    from bs4 import BeautifulSoup

    temp  ="""<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<ODM xmlns="http://www.cdisc.org/ns/odm/v1.3" 
        xmlns:mdsol="http://www.mdsol.com/ns/odm/metadata" 
        CreationDateTime="2019-08-23T12:59:09" FileOID="3b2b4161-fad8-4239-9c83-03d0e62624dd" FileType="Transactional" ODMVersion="1.3">

    <ClinicalData MetaDataVersionOID="1772" StudyOID="0ACC SP3 MAPPING1(DEV)" mdsol:AuditSubCategoryName="Activated">
        <SubjectData SubjectKey="7735fd9c-1792-457c-aa58-0ca26ecdc810" mdsol:SubjectKeyType="SubjectUUID" mdsol:SubjectName="ACC-SUBJ-3">
            <SiteRef LocationOID="0ACCSP3MAPPING1SITE1"/>
            <StudyEventData StudyEventOID="FV" StudyEventRepeatKey="VIST[1]/FV[1]" mdsol:InstanceId="2960580">
                <AuditRecord>
                    <UserRef UserOID="systemuser"/>
                    <LocationRef LocationOID="0ACCSP3MAPPING1SITE1"/>
                    <DateTimeStamp>2019-07-10T07:56:54</DateTimeStamp>
                    <ReasonForChange>Update</ReasonForChange>
                    <SourceID>394263772</SourceID>
                </AuditRecord>
            </StudyEventData>
        </SubjectData>
    </ClinicalData>
</ODM>"""



temp=BeautifulSoup(temp,"lxml")
ClinicalData = temp.find('ClinicalData'.lower())
SubjectData = ClinicalData.find_all('SubjectData'.lower())
LocationOID=None
for i in SubjectData:
    SiteRef = i.find('SiteRef'.lower())
    LocationOID = SiteRef.attrs['locationoid']


print('LocationOID',LocationOID)

output:

LocationOID 0ACCSP3MAPPING1SITE1
[Finished in 1.2s]

@Justin I have applied your suggestions, it worked, until I broke it.

Input:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<ODM xmlns="http://www.cdisc.org/ns/odm/v1.3" xmlns:mdsol="http://www.mdsol.com/ns/odm/metadata" CreationDateTime="2019-08-23T12:59:09" FileOID="3b2b4161-fad8-4239-9c83-03d0e62624dd" FileType="Transactional" ODMVersion="1.3">
    <ClinicalData MetaDataVersionOID="2965" StudyOID="0ACC SP3 MAPPING1(DEV)" mdsol:AuditSubCategoryName="Entered">
        <SubjectData SubjectKey="481e4653-693c-4e15-8762-d8a66c0d2cf1" mdsol:SubjectKeyType="SubjectUUID" mdsol:SubjectName="ACC-SUBJ-1">
            <SiteRef LocationOID="0ACCSP3MAPPING1SITE1"/>
            <StudyEventData StudyEventOID="FV" StudyEventRepeatKey="VIST[1]/FV[1]" mdsol:InstanceId="2960564">
                <FormData FormOID="VS" FormRepeatKey="1" mdsol:DataPageId="15331229">
                    <ItemGroupData ItemGroupOID="VS" mdsol:RecordId="17928808">
                        <ItemData ItemOID="VS.WT" TransactionType="Upsert" Value="45">
                            <AuditRecord>
                                <UserRef UserOID="alscrave2"/>
                                <LocationRef LocationOID="0ACCSP3MAPPING1SITE1"/>
                                <DateTimeStamp>2018-02-02T09:39:30</DateTimeStamp>
                                <ReasonForChange/>
                                <SourceID>122841525</SourceID>
                            </AuditRecord>
                            <MeasurementUnitRef MeasurementUnitOID="1761.Weight.1"/>
                        </ItemData>
                    </ItemGroupData>
                </FormData>
            </StudyEventData>
        </SubjectData>
    </ClinicalData>
    <ClinicalData MetaDataVersionOID="2965" StudyOID="0ACC SP3 MAPPING1(DEV)" mdsol:AuditSubCategoryName="Entered">
        <SubjectData SubjectKey="481e4653-693c-4e15-8762-d8a66c0d2cf1" mdsol:SubjectKeyType="SubjectUUID" mdsol:SubjectName="ACC-SUBJ-1">
            <SiteRef LocationOID="0ACCSP3MAPPING1SITE1"/>
            <StudyEventData StudyEventOID="FV" StudyEventRepeatKey="VIST[1]/FV[1]" mdsol:InstanceId="2960564">
                <FormData FormOID="VS" FormRepeatKey="1" mdsol:DataPageId="15331229">
                    <ItemGroupData ItemGroupOID="VS" mdsol:RecordId="17928809">
                        <ItemData ItemOID="VS.WT" TransactionType="Upsert" Value="46">
                            <AuditRecord>
                                <UserRef UserOID="alscrave2"/>
                                <LocationRef LocationOID="0ACCSP3MAPPING1SITE1"/>
                                <DateTimeStamp>2018-02-02T09:39:30</DateTimeStamp>
                                <ReasonForChange/>
                                <SourceID>122841525</SourceID>
                            </AuditRecord>
                            <MeasurementUnitRef MeasurementUnitOID="1761.Weight.1"/>
                        </ItemData>
                    </ItemGroupData>
                </FormData>
            </StudyEventData>
        </SubjectData>
    </ClinicalData>
</ODM>

Code:

import xml.etree.ElementTree as ET
import pandas as pd

def getvalueofnode(node):
    """ return node text or None """
    return node.text if node is not None else None

tree = ET.parse("data.xml")
ODM = tree.getroot()

xmlns = "{http://www.cdisc.org/ns/odm/v1.3}"
mdsol = "{http://www.mdsol.com/ns/odm/metadata}"

def data_reader():
    dfcols = ['CreationDateTime','StudyOID','MetaDataVersionOID','SubjectName','SUBJECTUUID','LocationOID','StudyEventOID',
             'StudyEventRepeatKey','FormOID','FormRepeatKey','DataPageId','ItemgroupOID','RecordId','var_name','Value',
             'DateTimeStamp','ASC_Name','Measurement_Unit','SourceID','UserOID','InstanceId']
    df_xml = pd.DataFrame(columns=dfcols)

    CreationDateTime = ODM.attrib.get('CreationDateTime')

    for ClinicalData in ODM:
        StudyOID = ClinicalData.attrib.get('StudyOID')
        MetaDataVersionOID = ClinicalData.attrib.get('MetaDataVersionOID')
        ASC_Name = ClinicalData.attrib.get('{0}AuditSubCategoryName'.format(mdsol))
        for SubjectData in ClinicalData:
            SubjectName = SubjectData.attrib.get('{0}SubjectName'.format(mdsol))
            SUBJECTUUID = SubjectData.attrib.get('SubjectKey')
            LocationOID = SubjectData.find('{0}SiteRef'.format(xmlns)).attrib.get('LocationOID')
            for StudyEventData in SubjectData:
                StudyEventOID = StudyEventData.attrib.get('StudyEventOID')
                StudyEventRepeatKey = StudyEventData.attrib.get('StudyEventRepeatKey')
                InstanceId = StudyEventData.attrib.get('{0}InstanceId'.format(mdsol))
                for FormData in StudyEventData:
                    FormOID = FormData.attrib.get('FormOID')
                    FormRepeatKey = FormData.attrib.get('FormRepeatKey')
                    DataPageId = FormData.attrib.get('{0}DataPageId'.format(mdsol))
                    for ItemGroupData in FormData:
                        ItemgroupOID = ItemGroupData.attrib.get('ItemgroupOID')
                        RecordId = ItemGroupData.attrib.get('{0}RecordId'.format(mdsol))
                        for ItemData in ItemGroupData:
                            var_name = ItemData.attrib.get('ItemOID')
                            Value = ItemData.attrib.get('Value')
                            Measurement_Unit = ItemData.find('MeasurementUnitRef'.format(xmlns)).attrib.get('MeasurementUnitOID')
                            for AuditRecord in ItemData:
                                DateTimeStamp = AuditRecord.find('{0}DateTimeStamp'.format(xmlns)).text;
                                SourceID = AuditRecord.find('{0}SourceID'.format(xmlns)).text; 
                                UserOID = ItemData.find('{0}UserRef'.format(xmlns)).attrib.get('UserOID')
                                df_xml = df_xml.append(
                                pd.Series([CreationDateTime,StudyOID,MetaDataVersionOID,SubjectName,
                                           SUBJECTUUID,LocationOID,StudyEventOID,
                                           StudyEventRepeatKey,FormOID,FormRepeatKey,DataPageId,ItemgroupOID,
                                           RecordId,var_name,Value,DateTimeStamp,ASC_Name,Measurement_Unit,
                                           SourceID,UserOID,InstanceId], index=dfcols),
                                        ignore_index=True)

    print(df_xml)
data_reader()

Issue: I am getting duplicate records. And variables DateTimeStamp, SourceID, UserOID and Measurement_Unit are throwing run time errors during assignment.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM