简体   繁体   中英

Extracting a tidy data frame from a highly nested XML file

I have a complex, multiply nested, XML file that I am trying to extract data from and convert into a data frame, for subsequent plotting and analysis etc. Solutions with either R or Python would be fine, but I've never worked with XML files and I'm struggling to understand how to extract the data I need (I'm reading up on XPath syntax, which is new to me).

I've tried using the R packages XML, xml2, and xmltools, and I've also experimented with Python element trees. Most of the examples I've tried following use much simpler XML files, and I've not figured out how to extend the logic to my own case, and only ended up with nonsensical mess.

The structure of the XML file is:

(1) ------------

 ├── XMLFILE
├── DATASET


 (2) ------------

 └── GROUPDATA
  └── GROUP
    ├── METHODDATA
    ├── SAMPLELISTDATA
      ├── SAMPLE
        ├── USERDATA
        ├── COMPOUND
          ├── METHOD
          ├── USERDATA
          └── PEAK
            └── ISPEAK
        ├── COMPOUND
          ├── METHOD
          ├── USERDATA
          └── PEAK
            └── ISPEAK
      └── SAMPLE
        ├── USERDATA
        ├── COMPOUND
          ├── METHOD
          ├── USERDATA
          └── PEAK
            └── ISPEAK
        ├── COMPOUND
          ├── METHOD
          ├── USERDATA
          └── PEAK
            └── ISPEAK
    └── CALIBRATIONDATA
      ├── COMPOUND
        ├── RESPONSE
        └── CURVE
          └── RESPONSEFACTOR
      └── COMPOUND
        ├── RESPONSE
        └── CURVE
          ├── CALIBRATIONCURVE
          └── DETERMINATION

I only care about what's in the SAMPLELISTDATA section. Also, I've only shown 2 SAMPLES, and 2 COMPOUNDS in each SAMPLE, however in the real file there are many of both. All of the tags in the tree also have many attributes, which I need to extract data from.

The actual XML is huge, but here's a (somewhat) minimal example:

<QUANDATASET description="" version="1">
    <XMLFILE filename="C:\Masslynx  Projects\Polyphenols_Dev.PRO\quandata.xml" modifieddate="20 Dec 2021" modifiedtime="15:53:06"/>
    <DATASET filename="C:\Masslynx  Projects\Polyphenols_Dev.PRO\211220_MAA_Jack.qld" modifieddate="20 Dec 2021" modifiedtime="15:50:10" creationdate="20 Dec 2021" creationtime="14:18:02"/>
    <GROUPDATA count="1">
        <GROUP id="1" name="MAA_JACK">
            <METHODDATA id="1" filename="C:\Masslynx  Projects\Polyphenols_Dev.PRO\MethDB\MAA_Jack.mdb" modifieddate="20 Dec 2021" modifiedtime="14:04:55" creationdate="20 Dec 2021" creationtime="14:04:55"/>
            <SAMPLELISTDATA filename="C:\Masslynx  Projects\Polyphenols_Dev.PRO\SampleDB\MAA_211220.SPL" modifieddate="20 Dec 2021" modifiedtime="09:55:58" count="12">
                <SAMPLE id="1" groupid="1" name="MAA_211220_01" createdate="20-Dec-21" createtime="10:00:08" type="Analyte" desc="'Umbilicalis' laver filtrate 7D7" dilutionfac="0.0000000000" extractvolume="0.0000000000" initamount="0.0000000000" injectvolume="2.0000000000" job="MAA_211220" sampleid="" samplenumber="1" stdconc="0.0000000000" stockdilutionfac="0.0000000000" subjecttext="" subjecttime="0.0000000000" userdilutionfac="0.0000000000" vial="1:A,1" inletmethodname="C:\Masslynx  Projects\Polyphenols_Dev.PRO\ACQUDB\MAA_Dev_17" msmethodname="C:\Masslynx  Projects\Polyphenols_Dev.PRO\ACQUDB\MAAs SIR5.EXP" prerunmethodname="" postrunmethodname="" switchmethodname="" hplcmethodname="" tunemethodname="C:\Masslynx  Projects\Histamine_QDA_Dev.PRO\ACQUDB\Default.ipr" fractionlynxname="" instrument="ACQ-QDA#KAD3691" lab="" conditions="" submitter="" task="" user="" reinjections="0" text="'Umbilicalis' laver filtrate 7D7">
                    <COMPOUND id="1" sampleid="1" groupid="1" name="Palythine" type="" cas="" stdconc="0.0000000000">
                        <PEAK foundscan="514" foundrt="1.7100000381" foundrrt="0.0000000000" predrt="1.7500000000" predrrt="0.0000000000" area="89222.9220000000" height="1567686.0000000000" response="89222.9220000000" pkflags="MM!" analconc="0.0000000000" empc="0.0000000000" bsanalconc="0.0000000000" conccalc="NaN" modifieddate="20-Dec-21" modifiedtime="14:22:50" modifiedtext="" modifieduser="" peakmass="0.0000000000" startrt="1.6399999857" endrt="1.7532999516" startht="-10476.0000000000" endht="-10476.0000000000" absresponse="89222.9220000000" rrtref="0" quanratio="0.0000000000" quanratiopred="1.0000000000" quanratiowin="0.0000000000" ionratio="0.0000000000" ionratiopred="0.0000000000" ionratiowin="0" ionratioflag="0" chromnoise="11.0944900513" detectionthreshold="0.0000000000" detectionflag="0" quanthreshold="0.0000000000" quanflag="0" snlodflag="0" snloqflag="0" rrf="0.0000000000" chromtrace="318_322" peaks="0" pkwidth="3.0210000000" pksigma="6.3800000000" pkskew="-0.1190000000" pkkurt="-0.4500000000" heightdivarea="17.5704400266" baselinewidth="6.7979979515" peakquality="n/a" peakqualitydesc="" peakqualityref="N" replimflag="0" maxreplimflag="0" recovlimflag="0" matrixblankflag="0" solventblankflag="0" devflag="0" devflagmidconc="0" devflaglowconc="0" qcsignoiseflag="0" qcionratioflag="0" qcrettimeflag="0" qcpeakshapeflag="0" signoise="141303.1146768486" signoiseflag="0" cdflag="0" stddevflag="0" rtflag="0" peakasymmetry="0" peakfrontwidth="0.0700000003" peaktailwidth="0.0430000015" peakasymmetryvalue="0.6190000176" percrecovery="0.0000000000" symflag="" percsym="0.0000000000" belowrl="1" chromnoisehgt="0.0000000000" concdevperc="0.0000000000" lowerbound1="0.0000000000" lowerbound2="0.0000000000" lowerbound3="0.0000000000" lowerbound4="0.0000000000" mediumbound1="0.0000000000" mediumbound2="0.0000000000" mediumbound3="0.0000000000" mediumbound4="0.0000000000" upperbound1="0.0000000000" upperbound2="0.0000000000" upperbound3="0.0000000000" upperbound4="0.0000000000" nosolflag="0" peakmissing="0" peaksinc="0" toxconc1="0.0000000000" toxconc2="0.0000000000" toxconc3="0.0000000000" toxconc4="0.0000000000" toxfactor1="0.0000000000" toxfactor2="0.0000000000" toxfactor3="0.0000000000" toxfactor4="0.0000000000" toxlod1="0.0000000000" toxlod2="0.0000000000" toxlod3="0.0000000000" toxlod4="0.0000000000" toxloq1="0.0000000000" toxloq2="0.0000000000" toxloq3="0.0000000000" toxloq4="0.0000000000" userfactor="1.0000000000" userrf="0.0000000000" picsforward="0" picsreverse="0" iFIT="N/A" iFITnorm="N/A" iFITconfidence="N/A" foundmass="N/A" mDamasserror="N/A" ppmmasserror="N/A" iFitflag="0" iFitnormflag="0" iFitconfflag="0" mDaerrorflag="0" ppmerrorflag="0">
                            <ISPEAK area="" height="" foundrt="" absresponse=""/>
                        </PEAK>
                        <METHOD rref="0.0000000000" predrt="1.7500000000" predrrt="0.0000000000" userfactor="0.0000000000" userrf="0.0000000000" quantrace="318_322" secondarytrace="" useabsmasswin="1" chromasswinabs="1.0000000000" chromasswinppm="10.0000000000" stockconcfactor="0.0000000000" calibref="Palythine" replim="0.0000000000" replimflag="0" maxreplim="0.0000000000" maxreplimflag="0" minrecovlim="0.0000000000" maxrecovlim="100.0000000000" recovlimflag="0" maxstddev="0.0000000000" signoiseflag="0" mincoeffdet="0.5000000000" cdflag="0" minpeakwidth="0.0000000000" peakwidthtol="0.0000000000" peakwidthflag="0" blanklevel="0.0000000000" stddevflag="0" rtupper="0.0000000000" rtlower="0.0000000000" rtflag="0"/>
                        <USERDATA sampleid="1" groupid="1"/>
                    </COMPOUND>
                    <COMPOUND id="14" sampleid="1" groupid="1" name="Porphyra 334 SIR" type="" cas="" stdconc="0.0000000000">
                        <PEAK foundscan="161" foundrt="3.3292999268" foundrrt="0.0000000000" predrt="3.3099999428" predrrt="1.0000000000" area="2140861.2500000000" height="16134221.0000000000" response="2140861.2500000000" pkflags="bb" analconc="0.0000000000" empc="0.0000000000" bsanalconc="0.0000000000" conccalc="NaN" modifieddate="" modifiedtime="" modifiedtext="" modifieduser="" peakmass="0.0000000000" startrt="3.1303999424" endrt="3.7107000351" startht="3651.8000000000" endht="16670.4000000000" absresponse="2140861.2500000000" rrtref="0" quanratio="0.0000000000" quanratiopred="1.0000000000" quanratiowin="0.0000000000" ionratio="0.0000000000" ionratiopred="0.0000000000" ionratiowin="0" ionratioflag="0" chromnoise="334.2170715332" detectionthreshold="0.0000000000" detectionflag="0" quanthreshold="0.0000000000" quanflag="0" snlodflag="0" snloqflag="0" rrf="0.0000000000" chromtrace="347.1" peaks="0" pkwidth="7.7870000000" pksigma="3.2770000000" pkskew="0.6590000000" pkkurt="1.4860000000" heightdivarea="7.5363225898" baselinewidth="34.8180055618" peakquality="n/a" peakqualitydesc="" peakqualityref="N" replimflag="0" maxreplimflag="0" recovlimflag="0" matrixblankflag="0" solventblankflag="0" devflag="0" devflagmidconc="0" devflaglowconc="0" qcsignoiseflag="0" qcionratioflag="0" qcrettimeflag="0" qcpeakshapeflag="0" signoise="48274.6764729440" signoiseflag="0" cdflag="0" stddevflag="0" rtflag="0" peakasymmetry="0" peakfrontwidth="0.2000000030" peaktailwidth="0.3799999952" peakasymmetryvalue="1.8999999762" percrecovery="0.0000000000" symflag="" percsym="0.0000000000" belowrl="1" chromnoisehgt="6160.2280000000" concdevperc="0.0000000000" lowerbound1="0.0000000000" lowerbound2="0.0000000000" lowerbound3="0.0000000000" lowerbound4="0.0000000000" mediumbound1="0.0000000000" mediumbound2="0.0000000000" mediumbound3="0.0000000000" mediumbound4="0.0000000000" upperbound1="0.0000000000" upperbound2="0.0000000000" upperbound3="0.0000000000" upperbound4="0.0000000000" nosolflag="0" peakmissing="0" peaksinc="0" toxconc1="0.0000000000" toxconc2="0.0000000000" toxconc3="0.0000000000" toxconc4="0.0000000000" toxfactor1="0.0000000000" toxfactor2="0.0000000000" toxfactor3="0.0000000000" toxfactor4="0.0000000000" toxlod1="0.0000000000" toxlod2="0.0000000000" toxlod3="0.0000000000" toxlod4="0.0000000000" toxloq1="0.0000000000" toxloq2="0.0000000000" toxloq3="0.0000000000" toxloq4="0.0000000000" userfactor="1.0000000000" userrf="0.0000000000" picsforward="0" picsreverse="0" iFIT="N/A" iFITnorm="N/A" iFITconfidence="N/A" foundmass="N/A" mDamasserror="N/A" ppmmasserror="N/A" iFitflag="0" iFitnormflag="0" iFitconfflag="0" mDaerrorflag="0" ppmerrorflag="0">
                            <ISPEAK area="" height="" foundrt="" absresponse=""/>
                        </PEAK>
                        <METHOD rref="0.0000000000" predrt="3.3099999428" predrrt="1.0000000000" userfactor="0.0000000000" userrf="0.0000000000" quantrace="347.1" secondarytrace="" useabsmasswin="1" chromasswinabs="1.0000000000" chromasswinppm="10.0000000000" stockconcfactor="0.0000000000" calibref="Porphyra 334 SIR" replim="0.0000000000" replimflag="0" maxreplim="0.0000000000" maxreplimflag="0" minrecovlim="0.0000000000" maxrecovlim="100.0000000000" recovlimflag="0" maxstddev="0.0000000000" signoiseflag="0" mincoeffdet="0.5000000000" cdflag="0" minpeakwidth="0.0000000000" peakwidthtol="0.0000000000" peakwidthflag="0" blanklevel="0.0000000000" stddevflag="0" rtupper="0.0000000000" rtlower="0.0000000000" rtflag="0"/>
                        <USERDATA sampleid="1" groupid="1"/>
                    </COMPOUND>
                    <USERDATA sampleid="1" groupid="1"/>
                </SAMPLE>
                <SAMPLE id="2" groupid="1" name="MAA_211220_02" createdate="20-Dec-21" createtime="10:11:04" type="Analyte" desc="'Umbilicalis' laver filtrate 3D9" dilutionfac="0.0000000000" extractvolume="0.0000000000" initamount="0.0000000000" injectvolume="2.0000000000" job="MAA_211220" sampleid="" samplenumber="2" stdconc="0.0000000000" stockdilutionfac="0.0000000000" subjecttext="" subjecttime="0.0000000000" userdilutionfac="0.0000000000" vial="1:A,2" inletmethodname="C:\Masslynx  Projects\Polyphenols_Dev.PRO\ACQUDB\MAA_Dev_17" msmethodname="C:\Masslynx  Projects\Polyphenols_Dev.PRO\ACQUDB\MAAs SIR5.EXP" prerunmethodname="" postrunmethodname="" switchmethodname="" hplcmethodname="" tunemethodname="C:\Masslynx  Projects\Histamine_QDA_Dev.PRO\ACQUDB\Default.ipr" fractionlynxname="" instrument="ACQ-QDA#KAD3691" lab="" conditions="" submitter="" task="" user="" reinjections="0" text="'Umbilicalis' laver filtrate 3D9">
                    <COMPOUND id="1" sampleid="2" groupid="1" name="Palythine" type="" cas="" stdconc="0.0000000000">
                        <PEAK foundscan="517" foundrt="1.7200000286" foundrrt="0.0000000000" predrt="1.7500000000" predrrt="0.0000000000" area="69654.0080000000" height="1250121.0000000000" response="69654.0080000000" pkflags="MM!" analconc="0.0000000000" empc="0.0000000000" bsanalconc="0.0000000000" conccalc="NaN" modifieddate="20-Dec-21" modifiedtime="14:24:57" modifiedtext="" modifieduser="" peakmass="0.0000000000" startrt="1.6000000238" endrt="1.7599999905" startht="0.0000000000" endht="10847.0340000000" absresponse="69654.0080000000" rrtref="0" quanratio="0.0000000000" quanratiopred="1.0000000000" quanratiowin="0.0000000000" ionratio="0.0000000000" ionratiopred="0.0000000000" ionratiowin="0" ionratioflag="0" chromnoise="4.1693286896" detectionthreshold="0.0000000000" detectionflag="0" quanthreshold="0.0000000000" quanflag="0" snlodflag="0" snloqflag="0" rrf="0.0000000000" chromtrace="318_322" peaks="0" pkwidth="3.0090000000" pksigma="6.4940000000" pkskew="-0.4530000000" pkkurt="0.7820000000" heightdivarea="17.9475817099" baselinewidth="9.5999979973" peakquality="n/a" peakqualitydesc="" peakqualityref="N" replimflag="0" maxreplimflag="0" recovlimflag="0" matrixblankflag="0" solventblankflag="0" devflag="0" devflagmidconc="0" devflaglowconc="0" qcsignoiseflag="0" qcionratioflag="0" qcrettimeflag="0" qcpeakshapeflag="0" signoise="299837.4781816338" signoiseflag="0" cdflag="0" stddevflag="0" rtflag="0" peakasymmetry="0" peakfrontwidth="0.1199999973" peaktailwidth="0.0399999991" peakasymmetryvalue="0.3330000043" percrecovery="0.0000000000" symflag="" percsym="0.0000000000" belowrl="1" chromnoisehgt="0.0000000000" concdevperc="0.0000000000" lowerbound1="0.0000000000" lowerbound2="0.0000000000" lowerbound3="0.0000000000" lowerbound4="0.0000000000" mediumbound1="0.0000000000" mediumbound2="0.0000000000" mediumbound3="0.0000000000" mediumbound4="0.0000000000" upperbound1="0.0000000000" upperbound2="0.0000000000" upperbound3="0.0000000000" upperbound4="0.0000000000" nosolflag="0" peakmissing="0" peaksinc="0" toxconc1="0.0000000000" toxconc2="0.0000000000" toxconc3="0.0000000000" toxconc4="0.0000000000" toxfactor1="0.0000000000" toxfactor2="0.0000000000" toxfactor3="0.0000000000" toxfactor4="0.0000000000" toxlod1="0.0000000000" toxlod2="0.0000000000" toxlod3="0.0000000000" toxlod4="0.0000000000" toxloq1="0.0000000000" toxloq2="0.0000000000" toxloq3="0.0000000000" toxloq4="0.0000000000" userfactor="1.0000000000" userrf="0.0000000000" picsforward="0" picsreverse="0" iFIT="N/A" iFITnorm="N/A" iFITconfidence="N/A" foundmass="N/A" mDamasserror="N/A" ppmmasserror="N/A" iFitflag="0" iFitnormflag="0" iFitconfflag="0" mDaerrorflag="0" ppmerrorflag="0">
                            <ISPEAK area="" height="" foundrt="" absresponse=""/>
                        </PEAK>
                        <METHOD rref="0.0000000000" predrt="1.7500000000" predrrt="0.0000000000" userfactor="0.0000000000" userrf="0.0000000000" quantrace="318_322" secondarytrace="" useabsmasswin="1" chromasswinabs="1.0000000000" chromasswinppm="10.0000000000" stockconcfactor="0.0000000000" calibref="Palythine" replim="0.0000000000" replimflag="0" maxreplim="0.0000000000" maxreplimflag="0" minrecovlim="0.0000000000" maxrecovlim="100.0000000000" recovlimflag="0" maxstddev="0.0000000000" signoiseflag="0" mincoeffdet="0.5000000000" cdflag="0" minpeakwidth="0.0000000000" peakwidthtol="0.0000000000" peakwidthflag="0" blanklevel="0.0000000000" stddevflag="0" rtupper="0.0000000000" rtlower="0.0000000000" rtflag="0"/>
                        <USERDATA sampleid="2" groupid="1"/>
                    </COMPOUND>
                    <COMPOUND id="14" sampleid="2" groupid="1" name="Porphyra 334 SIR" type="" cas="" stdconc="0.0000000000">
                        <PEAK foundscan="162" foundrt="3.3459000587" foundrrt="0.0000000000" predrt="3.3099999428" predrrt="1.0000000000" area="1934833.8750000000" height="14881056.0000000000" response="1934833.8750000000" pkflags="bb" analconc="0.0000000000" empc="0.0000000000" bsanalconc="0.0000000000" conccalc="NaN" modifieddate="" modifiedtime="" modifiedtext="" modifieduser="" peakmass="0.0000000000" startrt="3.1800999641" endrt="3.7107000351" startht="5267.0000000000" endht="16324.8000000000" absresponse="1934833.8750000000" rrtref="0" quanratio="0.0000000000" quanratiopred="1.0000000000" quanratiowin="0.0000000000" ionratio="0.0000000000" ionratiopred="0.0000000000" ionratiowin="0" ionratioflag="0" chromnoise="208.7208557129" detectionthreshold="0.0000000000" detectionflag="0" quanthreshold="0.0000000000" quanflag="0" snlodflag="0" snloqflag="0" rrf="0.0000000000" chromtrace="347.1" peaks="0" pkwidth="7.5160000000" pksigma="3.2120000000" pkskew="0.6470000000" pkkurt="1.3920000000" heightdivarea="7.6911285213" baselinewidth="31.8360042572" peakquality="n/a" peakqualitydesc="" peakqualityref="N" replimflag="0" maxreplimflag="0" recovlimflag="0" matrixblankflag="0" solventblankflag="0" devflag="0" devflagmidconc="0" devflaglowconc="0" qcsignoiseflag="0" qcionratioflag="0" qcrettimeflag="0" qcpeakshapeflag="0" signoise="71296.4497446734" signoiseflag="0" cdflag="0" stddevflag="0" rtflag="0" peakasymmetry="0" peakfrontwidth="0.1669999957" peaktailwidth="0.3639999926" peakasymmetryvalue="2.1860001087" percrecovery="0.0000000000" symflag="" percsym="0.0000000000" belowrl="1" chromnoisehgt="5185.1130000000" concdevperc="0.0000000000" lowerbound1="0.0000000000" lowerbound2="0.0000000000" lowerbound3="0.0000000000" lowerbound4="0.0000000000" mediumbound1="0.0000000000" mediumbound2="0.0000000000" mediumbound3="0.0000000000" mediumbound4="0.0000000000" upperbound1="0.0000000000" upperbound2="0.0000000000" upperbound3="0.0000000000" upperbound4="0.0000000000" nosolflag="0" peakmissing="0" peaksinc="0" toxconc1="0.0000000000" toxconc2="0.0000000000" toxconc3="0.0000000000" toxconc4="0.0000000000" toxfactor1="0.0000000000" toxfactor2="0.0000000000" toxfactor3="0.0000000000" toxfactor4="0.0000000000" toxlod1="0.0000000000" toxlod2="0.0000000000" toxlod3="0.0000000000" toxlod4="0.0000000000" toxloq1="0.0000000000" toxloq2="0.0000000000" toxloq3="0.0000000000" toxloq4="0.0000000000" userfactor="1.0000000000" userrf="0.0000000000" picsforward="0" picsreverse="0" iFIT="N/A" iFITnorm="N/A" iFITconfidence="N/A" foundmass="N/A" mDamasserror="N/A" ppmmasserror="N/A" iFitflag="0" iFitnormflag="0" iFitconfflag="0" mDaerrorflag="0" ppmerrorflag="0">
                            <ISPEAK area="" height="" foundrt="" absresponse=""/>
                        </PEAK>
                        <METHOD rref="0.0000000000" predrt="3.3099999428" predrrt="1.0000000000" userfactor="0.0000000000" userrf="0.0000000000" quantrace="347.1" secondarytrace="" useabsmasswin="1" chromasswinabs="1.0000000000" chromasswinppm="10.0000000000" stockconcfactor="0.0000000000" calibref="Porphyra 334 SIR" replim="0.0000000000" replimflag="0" maxreplim="0.0000000000" maxreplimflag="0" minrecovlim="0.0000000000" maxrecovlim="100.0000000000" recovlimflag="0" maxstddev="0.0000000000" signoiseflag="0" mincoeffdet="0.5000000000" cdflag="0" minpeakwidth="0.0000000000" peakwidthtol="0.0000000000" peakwidthflag="0" blanklevel="0.0000000000" stddevflag="0" rtupper="0.0000000000" rtlower="0.0000000000" rtflag="0"/>
                        <USERDATA sampleid="2" groupid="1"/>
                    </COMPOUND>
                    <USERDATA sampleid="2" groupid="1"/>
                </SAMPLE>
            </SAMPLELISTDATA>
            <CALIBRATIONDATA filename="C:\Masslynx  Projects\Caffeine.PRO\CurveDB\Meth1.cdb" modifieddate="25 Sep 2015" modifiedtime="00:20:14" count="2">
                <COMPOUND id="1" name="Compound A ( 430.5 )">
                    <RESPONSE type="External Std" ref="" rah="Area"/>
                    <CURVE type="RF" origin="" weighting="" axistrans="">
                        <RESPONSEFACTOR cc="15552.5556000000" stddev="2208.2674143620" percrelsd="0.1319874310"/>
                    </CURVE>
                </COMPOUND>
                <COMPOUND id="2" name="Compound B ( 458.5 )">
                    <RESPONSE type="Internal Std" ref="1" rah="Area * ( IS Conc. / IS Area )"/>
                    <CURVE type="Linear" origin="Exclude" weighting="1/x" axistrans="None">
                        <CALIBRATIONCURVE curve="0.012594 * x + 0.005516"/>
                        <DETERMINATION rsquared="0.9741537568"/>
                    </CURVE>
                </COMPOUND>
            </CALIBRATIONDATA>
        </GROUP>
    </GROUPDATA>
</QUANDATASET>

What I'm trying to get to is a single data frame (in either R or Python/Pandas) where each line represents all of the data (attributes) associated with a SAMPLE/COMPOUND pair (ie in my example above has 2 samples with 2 compounds each, which should be then 4 rows of the data frame, with many many columns for all of the attributes from all of the node/child/attributes associated with them).

A list of data frames, one for each sample, would also work, but then the sample names would need to be associated to each data frame in that list, so I think the one big data frame might be easier.

Thanks so much for any help/insights/tips/advice.

Here's what I managed to do, based on this previous question I've asked.

## Import the data
# Here, test.xml is the code you provided

data <- xml2::read_xml("Z:/temp/test.xml")

## Isolate as list SAMPLELISTDATA
data_list2 <- xml2::as_list(data)[[1]][[3]][[1]][[2]]

## Creating the output data.frame
output_desired <- data.frame(foundscan = NA, area = NA, height = NA) %>% 
  filter(!is.na(foundscan))


## Function to get the attributes
fusion_et_gestion <-  function(y){

  ## We choose the attributes we want to keep
  foundscan <- attr(y,"foundscan")
  area <- attr(y,"area")
  height <- attr(y, "height")
  
  ## Output as tibble
  tibble(foundscan = foundscan,
         area =area,
         height = height)
}


## Using for loops here, but map(1:2, ...) would be faster for you real data
for(j in 1:2) {
  for (i in 1:2) {
    test <- data_list2[[j]][[i]] %>%
      purrr::map_dfr(fusion_et_gestion)
    
    output_desired <<- rbind(output_desired, test) %>%
      filter()
  }
}


## Output:

# A tibble: 4 x 3
  foundscan area               height             
  <chr>     <chr>              <chr>              
1 514       89222.9220000000   1567686.0000000000 
2 161       2140861.2500000000 16134221.0000000000
3 517       69654.0080000000   1250121.0000000000 
4 162       1934833.8750000000 14881056.0000000000

However:

  1. If you want to keep the attributes in the node ISPEAK, you need to add a line inside fusion_et_gestion function and specify the level of x , ie x[[1]] . Be carefull then that the name you give to this attribute is not re-used later inside the function.
  2. I didn't find a way to include all the attributes, unless you type them one by one. As they are 196, an idea could be to add another function inside fusion_et_gestion that get all the attributes names and their values. That could be done with map(list_of_attribute, function_to_get_values) .

To get a list of the attributes, you can do:

data %>% 
    xml2::xml_find_all("//*") %>% 
    purrr::map(~names(xml2::xml_attrs(.))) %>%
    unlist() %>% 
    unique()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM