簡體   English   中英

從高度嵌套的 XML 文件中提取整齊的數據框

[英]Extracting a tidy data frame from a highly nested XML file

I have a complex, multiply nested, XML file that I am trying to extract data from and convert into a data frame, for subsequent plotting and analysis etc. Solutions with either R or Python would be fine, but I've never worked with XML文件,我正在努力理解如何提取我需要的數據(我正在閱讀 XPath 語法,這對我來說是新的)。

我嘗試使用 R 包 XML、xml2 和 xmltools,並且我還嘗試了 Python 元素樹。 我嘗試過的大多數示例都使用了更簡單的 XML 文件,而且我還沒有弄清楚如何將邏輯擴展到我自己的案例中,結果卻是一團糟。

XML文件的結構為:

(1) ------------

 ├── XMLFILE
├── DATASET


 (2) ------------

 └── GROUPDATA
  └── GROUP
    ├── METHODDATA
    ├── SAMPLELISTDATA
      ├── SAMPLE
        ├── USERDATA
        ├── COMPOUND
          ├── METHOD
          ├── USERDATA
          └── PEAK
            └── ISPEAK
        ├── COMPOUND
          ├── METHOD
          ├── USERDATA
          └── PEAK
            └── ISPEAK
      └── SAMPLE
        ├── USERDATA
        ├── COMPOUND
          ├── METHOD
          ├── USERDATA
          └── PEAK
            └── ISPEAK
        ├── COMPOUND
          ├── METHOD
          ├── USERDATA
          └── PEAK
            └── ISPEAK
    └── CALIBRATIONDATA
      ├── COMPOUND
        ├── RESPONSE
        └── CURVE
          └── RESPONSEFACTOR
      └── COMPOUND
        ├── RESPONSE
        └── CURVE
          ├── CALIBRATIONCURVE
          └── DETERMINATION

我只關心 SAMPLELISTDATA 部分中的內容。 此外,我在每個樣本中只展示了 2 個樣本和 2 個化合物,但是在真實文件中兩者都有很多。 樹中的所有標簽也有很多屬性,我需要從中提取數據。

實際的 XML 很大,但這是一個(有點)最小的例子:

<QUANDATASET description="" version="1">
    <XMLFILE filename="C:\Masslynx  Projects\Polyphenols_Dev.PRO\quandata.xml" modifieddate="20 Dec 2021" modifiedtime="15:53:06"/>
    <DATASET filename="C:\Masslynx  Projects\Polyphenols_Dev.PRO\211220_MAA_Jack.qld" modifieddate="20 Dec 2021" modifiedtime="15:50:10" creationdate="20 Dec 2021" creationtime="14:18:02"/>
    <GROUPDATA count="1">
        <GROUP id="1" name="MAA_JACK">
            <METHODDATA id="1" filename="C:\Masslynx  Projects\Polyphenols_Dev.PRO\MethDB\MAA_Jack.mdb" modifieddate="20 Dec 2021" modifiedtime="14:04:55" creationdate="20 Dec 2021" creationtime="14:04:55"/>
            <SAMPLELISTDATA filename="C:\Masslynx  Projects\Polyphenols_Dev.PRO\SampleDB\MAA_211220.SPL" modifieddate="20 Dec 2021" modifiedtime="09:55:58" count="12">
                <SAMPLE id="1" groupid="1" name="MAA_211220_01" createdate="20-Dec-21" createtime="10:00:08" type="Analyte" desc="'Umbilicalis' laver filtrate 7D7" dilutionfac="0.0000000000" extractvolume="0.0000000000" initamount="0.0000000000" injectvolume="2.0000000000" job="MAA_211220" sampleid="" samplenumber="1" stdconc="0.0000000000" stockdilutionfac="0.0000000000" subjecttext="" subjecttime="0.0000000000" userdilutionfac="0.0000000000" vial="1:A,1" inletmethodname="C:\Masslynx  Projects\Polyphenols_Dev.PRO\ACQUDB\MAA_Dev_17" msmethodname="C:\Masslynx  Projects\Polyphenols_Dev.PRO\ACQUDB\MAAs SIR5.EXP" prerunmethodname="" postrunmethodname="" switchmethodname="" hplcmethodname="" tunemethodname="C:\Masslynx  Projects\Histamine_QDA_Dev.PRO\ACQUDB\Default.ipr" fractionlynxname="" instrument="ACQ-QDA#KAD3691" lab="" conditions="" submitter="" task="" user="" reinjections="0" text="'Umbilicalis' laver filtrate 7D7">
                    <COMPOUND id="1" sampleid="1" groupid="1" name="Palythine" type="" cas="" stdconc="0.0000000000">
                        <PEAK foundscan="514" foundrt="1.7100000381" foundrrt="0.0000000000" predrt="1.7500000000" predrrt="0.0000000000" area="89222.9220000000" height="1567686.0000000000" response="89222.9220000000" pkflags="MM!" analconc="0.0000000000" empc="0.0000000000" bsanalconc="0.0000000000" conccalc="NaN" modifieddate="20-Dec-21" modifiedtime="14:22:50" modifiedtext="" modifieduser="" peakmass="0.0000000000" startrt="1.6399999857" endrt="1.7532999516" startht="-10476.0000000000" endht="-10476.0000000000" absresponse="89222.9220000000" rrtref="0" quanratio="0.0000000000" quanratiopred="1.0000000000" quanratiowin="0.0000000000" ionratio="0.0000000000" ionratiopred="0.0000000000" ionratiowin="0" ionratioflag="0" chromnoise="11.0944900513" detectionthreshold="0.0000000000" detectionflag="0" quanthreshold="0.0000000000" quanflag="0" snlodflag="0" snloqflag="0" rrf="0.0000000000" chromtrace="318_322" peaks="0" pkwidth="3.0210000000" pksigma="6.3800000000" pkskew="-0.1190000000" pkkurt="-0.4500000000" heightdivarea="17.5704400266" baselinewidth="6.7979979515" peakquality="n/a" peakqualitydesc="" peakqualityref="N" replimflag="0" maxreplimflag="0" recovlimflag="0" matrixblankflag="0" solventblankflag="0" devflag="0" devflagmidconc="0" devflaglowconc="0" qcsignoiseflag="0" qcionratioflag="0" qcrettimeflag="0" qcpeakshapeflag="0" signoise="141303.1146768486" signoiseflag="0" cdflag="0" stddevflag="0" rtflag="0" peakasymmetry="0" peakfrontwidth="0.0700000003" peaktailwidth="0.0430000015" peakasymmetryvalue="0.6190000176" percrecovery="0.0000000000" symflag="" percsym="0.0000000000" belowrl="1" chromnoisehgt="0.0000000000" concdevperc="0.0000000000" lowerbound1="0.0000000000" lowerbound2="0.0000000000" lowerbound3="0.0000000000" lowerbound4="0.0000000000" mediumbound1="0.0000000000" mediumbound2="0.0000000000" mediumbound3="0.0000000000" mediumbound4="0.0000000000" upperbound1="0.0000000000" upperbound2="0.0000000000" upperbound3="0.0000000000" upperbound4="0.0000000000" nosolflag="0" peakmissing="0" peaksinc="0" toxconc1="0.0000000000" toxconc2="0.0000000000" toxconc3="0.0000000000" toxconc4="0.0000000000" toxfactor1="0.0000000000" toxfactor2="0.0000000000" toxfactor3="0.0000000000" toxfactor4="0.0000000000" toxlod1="0.0000000000" toxlod2="0.0000000000" toxlod3="0.0000000000" toxlod4="0.0000000000" toxloq1="0.0000000000" toxloq2="0.0000000000" toxloq3="0.0000000000" toxloq4="0.0000000000" userfactor="1.0000000000" userrf="0.0000000000" picsforward="0" picsreverse="0" iFIT="N/A" iFITnorm="N/A" iFITconfidence="N/A" foundmass="N/A" mDamasserror="N/A" ppmmasserror="N/A" iFitflag="0" iFitnormflag="0" iFitconfflag="0" mDaerrorflag="0" ppmerrorflag="0">
                            <ISPEAK area="" height="" foundrt="" absresponse=""/>
                        </PEAK>
                        <METHOD rref="0.0000000000" predrt="1.7500000000" predrrt="0.0000000000" userfactor="0.0000000000" userrf="0.0000000000" quantrace="318_322" secondarytrace="" useabsmasswin="1" chromasswinabs="1.0000000000" chromasswinppm="10.0000000000" stockconcfactor="0.0000000000" calibref="Palythine" replim="0.0000000000" replimflag="0" maxreplim="0.0000000000" maxreplimflag="0" minrecovlim="0.0000000000" maxrecovlim="100.0000000000" recovlimflag="0" maxstddev="0.0000000000" signoiseflag="0" mincoeffdet="0.5000000000" cdflag="0" minpeakwidth="0.0000000000" peakwidthtol="0.0000000000" peakwidthflag="0" blanklevel="0.0000000000" stddevflag="0" rtupper="0.0000000000" rtlower="0.0000000000" rtflag="0"/>
                        <USERDATA sampleid="1" groupid="1"/>
                    </COMPOUND>
                    <COMPOUND id="14" sampleid="1" groupid="1" name="Porphyra 334 SIR" type="" cas="" stdconc="0.0000000000">
                        <PEAK foundscan="161" foundrt="3.3292999268" foundrrt="0.0000000000" predrt="3.3099999428" predrrt="1.0000000000" area="2140861.2500000000" height="16134221.0000000000" response="2140861.2500000000" pkflags="bb" analconc="0.0000000000" empc="0.0000000000" bsanalconc="0.0000000000" conccalc="NaN" modifieddate="" modifiedtime="" modifiedtext="" modifieduser="" peakmass="0.0000000000" startrt="3.1303999424" endrt="3.7107000351" startht="3651.8000000000" endht="16670.4000000000" absresponse="2140861.2500000000" rrtref="0" quanratio="0.0000000000" quanratiopred="1.0000000000" quanratiowin="0.0000000000" ionratio="0.0000000000" ionratiopred="0.0000000000" ionratiowin="0" ionratioflag="0" chromnoise="334.2170715332" detectionthreshold="0.0000000000" detectionflag="0" quanthreshold="0.0000000000" quanflag="0" snlodflag="0" snloqflag="0" rrf="0.0000000000" chromtrace="347.1" peaks="0" pkwidth="7.7870000000" pksigma="3.2770000000" pkskew="0.6590000000" pkkurt="1.4860000000" heightdivarea="7.5363225898" baselinewidth="34.8180055618" peakquality="n/a" peakqualitydesc="" peakqualityref="N" replimflag="0" maxreplimflag="0" recovlimflag="0" matrixblankflag="0" solventblankflag="0" devflag="0" devflagmidconc="0" devflaglowconc="0" qcsignoiseflag="0" qcionratioflag="0" qcrettimeflag="0" qcpeakshapeflag="0" signoise="48274.6764729440" signoiseflag="0" cdflag="0" stddevflag="0" rtflag="0" peakasymmetry="0" peakfrontwidth="0.2000000030" peaktailwidth="0.3799999952" peakasymmetryvalue="1.8999999762" percrecovery="0.0000000000" symflag="" percsym="0.0000000000" belowrl="1" chromnoisehgt="6160.2280000000" concdevperc="0.0000000000" lowerbound1="0.0000000000" lowerbound2="0.0000000000" lowerbound3="0.0000000000" lowerbound4="0.0000000000" mediumbound1="0.0000000000" mediumbound2="0.0000000000" mediumbound3="0.0000000000" mediumbound4="0.0000000000" upperbound1="0.0000000000" upperbound2="0.0000000000" upperbound3="0.0000000000" upperbound4="0.0000000000" nosolflag="0" peakmissing="0" peaksinc="0" toxconc1="0.0000000000" toxconc2="0.0000000000" toxconc3="0.0000000000" toxconc4="0.0000000000" toxfactor1="0.0000000000" toxfactor2="0.0000000000" toxfactor3="0.0000000000" toxfactor4="0.0000000000" toxlod1="0.0000000000" toxlod2="0.0000000000" toxlod3="0.0000000000" toxlod4="0.0000000000" toxloq1="0.0000000000" toxloq2="0.0000000000" toxloq3="0.0000000000" toxloq4="0.0000000000" userfactor="1.0000000000" userrf="0.0000000000" picsforward="0" picsreverse="0" iFIT="N/A" iFITnorm="N/A" iFITconfidence="N/A" foundmass="N/A" mDamasserror="N/A" ppmmasserror="N/A" iFitflag="0" iFitnormflag="0" iFitconfflag="0" mDaerrorflag="0" ppmerrorflag="0">
                            <ISPEAK area="" height="" foundrt="" absresponse=""/>
                        </PEAK>
                        <METHOD rref="0.0000000000" predrt="3.3099999428" predrrt="1.0000000000" userfactor="0.0000000000" userrf="0.0000000000" quantrace="347.1" secondarytrace="" useabsmasswin="1" chromasswinabs="1.0000000000" chromasswinppm="10.0000000000" stockconcfactor="0.0000000000" calibref="Porphyra 334 SIR" replim="0.0000000000" replimflag="0" maxreplim="0.0000000000" maxreplimflag="0" minrecovlim="0.0000000000" maxrecovlim="100.0000000000" recovlimflag="0" maxstddev="0.0000000000" signoiseflag="0" mincoeffdet="0.5000000000" cdflag="0" minpeakwidth="0.0000000000" peakwidthtol="0.0000000000" peakwidthflag="0" blanklevel="0.0000000000" stddevflag="0" rtupper="0.0000000000" rtlower="0.0000000000" rtflag="0"/>
                        <USERDATA sampleid="1" groupid="1"/>
                    </COMPOUND>
                    <USERDATA sampleid="1" groupid="1"/>
                </SAMPLE>
                <SAMPLE id="2" groupid="1" name="MAA_211220_02" createdate="20-Dec-21" createtime="10:11:04" type="Analyte" desc="'Umbilicalis' laver filtrate 3D9" dilutionfac="0.0000000000" extractvolume="0.0000000000" initamount="0.0000000000" injectvolume="2.0000000000" job="MAA_211220" sampleid="" samplenumber="2" stdconc="0.0000000000" stockdilutionfac="0.0000000000" subjecttext="" subjecttime="0.0000000000" userdilutionfac="0.0000000000" vial="1:A,2" inletmethodname="C:\Masslynx  Projects\Polyphenols_Dev.PRO\ACQUDB\MAA_Dev_17" msmethodname="C:\Masslynx  Projects\Polyphenols_Dev.PRO\ACQUDB\MAAs SIR5.EXP" prerunmethodname="" postrunmethodname="" switchmethodname="" hplcmethodname="" tunemethodname="C:\Masslynx  Projects\Histamine_QDA_Dev.PRO\ACQUDB\Default.ipr" fractionlynxname="" instrument="ACQ-QDA#KAD3691" lab="" conditions="" submitter="" task="" user="" reinjections="0" text="'Umbilicalis' laver filtrate 3D9">
                    <COMPOUND id="1" sampleid="2" groupid="1" name="Palythine" type="" cas="" stdconc="0.0000000000">
                        <PEAK foundscan="517" foundrt="1.7200000286" foundrrt="0.0000000000" predrt="1.7500000000" predrrt="0.0000000000" area="69654.0080000000" height="1250121.0000000000" response="69654.0080000000" pkflags="MM!" analconc="0.0000000000" empc="0.0000000000" bsanalconc="0.0000000000" conccalc="NaN" modifieddate="20-Dec-21" modifiedtime="14:24:57" modifiedtext="" modifieduser="" peakmass="0.0000000000" startrt="1.6000000238" endrt="1.7599999905" startht="0.0000000000" endht="10847.0340000000" absresponse="69654.0080000000" rrtref="0" quanratio="0.0000000000" quanratiopred="1.0000000000" quanratiowin="0.0000000000" ionratio="0.0000000000" ionratiopred="0.0000000000" ionratiowin="0" ionratioflag="0" chromnoise="4.1693286896" detectionthreshold="0.0000000000" detectionflag="0" quanthreshold="0.0000000000" quanflag="0" snlodflag="0" snloqflag="0" rrf="0.0000000000" chromtrace="318_322" peaks="0" pkwidth="3.0090000000" pksigma="6.4940000000" pkskew="-0.4530000000" pkkurt="0.7820000000" heightdivarea="17.9475817099" baselinewidth="9.5999979973" peakquality="n/a" peakqualitydesc="" peakqualityref="N" replimflag="0" maxreplimflag="0" recovlimflag="0" matrixblankflag="0" solventblankflag="0" devflag="0" devflagmidconc="0" devflaglowconc="0" qcsignoiseflag="0" qcionratioflag="0" qcrettimeflag="0" qcpeakshapeflag="0" signoise="299837.4781816338" signoiseflag="0" cdflag="0" stddevflag="0" rtflag="0" peakasymmetry="0" peakfrontwidth="0.1199999973" peaktailwidth="0.0399999991" peakasymmetryvalue="0.3330000043" percrecovery="0.0000000000" symflag="" percsym="0.0000000000" belowrl="1" chromnoisehgt="0.0000000000" concdevperc="0.0000000000" lowerbound1="0.0000000000" lowerbound2="0.0000000000" lowerbound3="0.0000000000" lowerbound4="0.0000000000" mediumbound1="0.0000000000" mediumbound2="0.0000000000" mediumbound3="0.0000000000" mediumbound4="0.0000000000" upperbound1="0.0000000000" upperbound2="0.0000000000" upperbound3="0.0000000000" upperbound4="0.0000000000" nosolflag="0" peakmissing="0" peaksinc="0" toxconc1="0.0000000000" toxconc2="0.0000000000" toxconc3="0.0000000000" toxconc4="0.0000000000" toxfactor1="0.0000000000" toxfactor2="0.0000000000" toxfactor3="0.0000000000" toxfactor4="0.0000000000" toxlod1="0.0000000000" toxlod2="0.0000000000" toxlod3="0.0000000000" toxlod4="0.0000000000" toxloq1="0.0000000000" toxloq2="0.0000000000" toxloq3="0.0000000000" toxloq4="0.0000000000" userfactor="1.0000000000" userrf="0.0000000000" picsforward="0" picsreverse="0" iFIT="N/A" iFITnorm="N/A" iFITconfidence="N/A" foundmass="N/A" mDamasserror="N/A" ppmmasserror="N/A" iFitflag="0" iFitnormflag="0" iFitconfflag="0" mDaerrorflag="0" ppmerrorflag="0">
                            <ISPEAK area="" height="" foundrt="" absresponse=""/>
                        </PEAK>
                        <METHOD rref="0.0000000000" predrt="1.7500000000" predrrt="0.0000000000" userfactor="0.0000000000" userrf="0.0000000000" quantrace="318_322" secondarytrace="" useabsmasswin="1" chromasswinabs="1.0000000000" chromasswinppm="10.0000000000" stockconcfactor="0.0000000000" calibref="Palythine" replim="0.0000000000" replimflag="0" maxreplim="0.0000000000" maxreplimflag="0" minrecovlim="0.0000000000" maxrecovlim="100.0000000000" recovlimflag="0" maxstddev="0.0000000000" signoiseflag="0" mincoeffdet="0.5000000000" cdflag="0" minpeakwidth="0.0000000000" peakwidthtol="0.0000000000" peakwidthflag="0" blanklevel="0.0000000000" stddevflag="0" rtupper="0.0000000000" rtlower="0.0000000000" rtflag="0"/>
                        <USERDATA sampleid="2" groupid="1"/>
                    </COMPOUND>
                    <COMPOUND id="14" sampleid="2" groupid="1" name="Porphyra 334 SIR" type="" cas="" stdconc="0.0000000000">
                        <PEAK foundscan="162" foundrt="3.3459000587" foundrrt="0.0000000000" predrt="3.3099999428" predrrt="1.0000000000" area="1934833.8750000000" height="14881056.0000000000" response="1934833.8750000000" pkflags="bb" analconc="0.0000000000" empc="0.0000000000" bsanalconc="0.0000000000" conccalc="NaN" modifieddate="" modifiedtime="" modifiedtext="" modifieduser="" peakmass="0.0000000000" startrt="3.1800999641" endrt="3.7107000351" startht="5267.0000000000" endht="16324.8000000000" absresponse="1934833.8750000000" rrtref="0" quanratio="0.0000000000" quanratiopred="1.0000000000" quanratiowin="0.0000000000" ionratio="0.0000000000" ionratiopred="0.0000000000" ionratiowin="0" ionratioflag="0" chromnoise="208.7208557129" detectionthreshold="0.0000000000" detectionflag="0" quanthreshold="0.0000000000" quanflag="0" snlodflag="0" snloqflag="0" rrf="0.0000000000" chromtrace="347.1" peaks="0" pkwidth="7.5160000000" pksigma="3.2120000000" pkskew="0.6470000000" pkkurt="1.3920000000" heightdivarea="7.6911285213" baselinewidth="31.8360042572" peakquality="n/a" peakqualitydesc="" peakqualityref="N" replimflag="0" maxreplimflag="0" recovlimflag="0" matrixblankflag="0" solventblankflag="0" devflag="0" devflagmidconc="0" devflaglowconc="0" qcsignoiseflag="0" qcionratioflag="0" qcrettimeflag="0" qcpeakshapeflag="0" signoise="71296.4497446734" signoiseflag="0" cdflag="0" stddevflag="0" rtflag="0" peakasymmetry="0" peakfrontwidth="0.1669999957" peaktailwidth="0.3639999926" peakasymmetryvalue="2.1860001087" percrecovery="0.0000000000" symflag="" percsym="0.0000000000" belowrl="1" chromnoisehgt="5185.1130000000" concdevperc="0.0000000000" lowerbound1="0.0000000000" lowerbound2="0.0000000000" lowerbound3="0.0000000000" lowerbound4="0.0000000000" mediumbound1="0.0000000000" mediumbound2="0.0000000000" mediumbound3="0.0000000000" mediumbound4="0.0000000000" upperbound1="0.0000000000" upperbound2="0.0000000000" upperbound3="0.0000000000" upperbound4="0.0000000000" nosolflag="0" peakmissing="0" peaksinc="0" toxconc1="0.0000000000" toxconc2="0.0000000000" toxconc3="0.0000000000" toxconc4="0.0000000000" toxfactor1="0.0000000000" toxfactor2="0.0000000000" toxfactor3="0.0000000000" toxfactor4="0.0000000000" toxlod1="0.0000000000" toxlod2="0.0000000000" toxlod3="0.0000000000" toxlod4="0.0000000000" toxloq1="0.0000000000" toxloq2="0.0000000000" toxloq3="0.0000000000" toxloq4="0.0000000000" userfactor="1.0000000000" userrf="0.0000000000" picsforward="0" picsreverse="0" iFIT="N/A" iFITnorm="N/A" iFITconfidence="N/A" foundmass="N/A" mDamasserror="N/A" ppmmasserror="N/A" iFitflag="0" iFitnormflag="0" iFitconfflag="0" mDaerrorflag="0" ppmerrorflag="0">
                            <ISPEAK area="" height="" foundrt="" absresponse=""/>
                        </PEAK>
                        <METHOD rref="0.0000000000" predrt="3.3099999428" predrrt="1.0000000000" userfactor="0.0000000000" userrf="0.0000000000" quantrace="347.1" secondarytrace="" useabsmasswin="1" chromasswinabs="1.0000000000" chromasswinppm="10.0000000000" stockconcfactor="0.0000000000" calibref="Porphyra 334 SIR" replim="0.0000000000" replimflag="0" maxreplim="0.0000000000" maxreplimflag="0" minrecovlim="0.0000000000" maxrecovlim="100.0000000000" recovlimflag="0" maxstddev="0.0000000000" signoiseflag="0" mincoeffdet="0.5000000000" cdflag="0" minpeakwidth="0.0000000000" peakwidthtol="0.0000000000" peakwidthflag="0" blanklevel="0.0000000000" stddevflag="0" rtupper="0.0000000000" rtlower="0.0000000000" rtflag="0"/>
                        <USERDATA sampleid="2" groupid="1"/>
                    </COMPOUND>
                    <USERDATA sampleid="2" groupid="1"/>
                </SAMPLE>
            </SAMPLELISTDATA>
            <CALIBRATIONDATA filename="C:\Masslynx  Projects\Caffeine.PRO\CurveDB\Meth1.cdb" modifieddate="25 Sep 2015" modifiedtime="00:20:14" count="2">
                <COMPOUND id="1" name="Compound A ( 430.5 )">
                    <RESPONSE type="External Std" ref="" rah="Area"/>
                    <CURVE type="RF" origin="" weighting="" axistrans="">
                        <RESPONSEFACTOR cc="15552.5556000000" stddev="2208.2674143620" percrelsd="0.1319874310"/>
                    </CURVE>
                </COMPOUND>
                <COMPOUND id="2" name="Compound B ( 458.5 )">
                    <RESPONSE type="Internal Std" ref="1" rah="Area * ( IS Conc. / IS Area )"/>
                    <CURVE type="Linear" origin="Exclude" weighting="1/x" axistrans="None">
                        <CALIBRATIONCURVE curve="0.012594 * x + 0.005516"/>
                        <DETERMINATION rsquared="0.9741537568"/>
                    </CURVE>
                </COMPOUND>
            </CALIBRATIONDATA>
        </GROUP>
    </GROUPDATA>
</QUANDATASET>

我想要得到的是一個單一的數據框(在 R 或 Python/Pandas 中),其中每一行代表與 SAMPLE/COMPOUND 對關聯的所有數據(屬性)(即在我上面的示例中有 2 個樣本每個 2 個化合物,然后應該是數據框的 4 行,其中包含與它們關聯的所有節點/子/屬性的所有屬性的許多列)。

一個數據框列表,每個樣本一個,也可以工作,但是樣本名稱需要與該列表中的每個數據框相關聯,所以我認為一個大數據框可能更容易。

非常感謝任何幫助/見解/提示/建議。

根據我之前提出的問題,這是我設法做到的。

## Import the data
# Here, test.xml is the code you provided

data <- xml2::read_xml("Z:/temp/test.xml")

## Isolate as list SAMPLELISTDATA
data_list2 <- xml2::as_list(data)[[1]][[3]][[1]][[2]]

## Creating the output data.frame
output_desired <- data.frame(foundscan = NA, area = NA, height = NA) %>% 
  filter(!is.na(foundscan))


## Function to get the attributes
fusion_et_gestion <-  function(y){

  ## We choose the attributes we want to keep
  foundscan <- attr(y,"foundscan")
  area <- attr(y,"area")
  height <- attr(y, "height")
  
  ## Output as tibble
  tibble(foundscan = foundscan,
         area =area,
         height = height)
}


## Using for loops here, but map(1:2, ...) would be faster for you real data
for(j in 1:2) {
  for (i in 1:2) {
    test <- data_list2[[j]][[i]] %>%
      purrr::map_dfr(fusion_et_gestion)
    
    output_desired <<- rbind(output_desired, test) %>%
      filter()
  }
}


## Output:

# A tibble: 4 x 3
  foundscan area               height             
  <chr>     <chr>              <chr>              
1 514       89222.9220000000   1567686.0000000000 
2 161       2140861.2500000000 16134221.0000000000
3 517       69654.0080000000   1250121.0000000000 
4 162       1934833.8750000000 14881056.0000000000

然而:

  1. 如果要保留節點ISPEAK中的屬性,則需要在fusion_et_gestion function中添加一行並指定x的級別,即x[[1]] 請注意,您賦予此屬性的名稱稍后不會在 function 中重復使用。
  2. 我沒有找到包含所有屬性的方法,除非您逐個鍵入它們。 由於它們是 196,因此一個想法可能是在fusion_et_gestion中添加另一個 function 以獲取所有屬性名稱及其值。 這可以通過map(list_of_attribute, function_to_get_values)來完成。

要獲取屬性列表,您可以執行以下操作:

data %>% 
    xml2::xml_find_all("//*") %>% 
    purrr::map(~names(xml2::xml_attrs(.))) %>%
    unlist() %>% 
    unique()

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM