簡體   English   中英

lxml 的 iterparse 上的 XPath 匹配其 scope 之外的元素

[英]XPath on lxml's iterparse matches elements outside its scope

我有大量的語料庫,我正在使用lxml進行解析,所以我使用iterparse這使得即時閱讀 XML 變得容易。 通過使用iterparse(fh, tag="your_tag")我們可以高效地迭代大文件中的節點。

我希望為文件中的每個主要標記做一些 XPath 匹配,在我的例子中alpino_ds 對於每個alpino_ds節點,我想檢查一些給定的 XPath 是否匹配。 然而,我發現 XPath 會匹配一個元素,而實際上它匹配文檔中的其他東西——不僅僅是當前迭代的alpino_ds元素,而是一個連續的元素。

我對為什么會發生這種情況感到困惑:在下面的示例中,我希望只有一個匹配項(在最后一個alpino_ds節點中),但正如您所看到的,它匹配了三次並且匹配的 XPath 結果在所有三種情況下都是相同的項目(最后一個節點的一部分)!

from io import BytesIO
import lxml.etree as ET

xml = """<treebank>
<alpino_ds version="1.3" id="WR-P-P-D-0000000006.p.34.s.1">
    <node begin="0" cat="top" end="4" id="0" rel="top">
      <node begin="0" cat="du" end="3" id="1" rel="--">
        <node begin="0" conjtype="neven" end="1" frame="complementizer(root)" id="2" lcat="du" lemma="en" pos="comp" postag="VG(neven)" pt="vg" rel="dlink" root="en" sc="root" sense="en" word="en"/>
        <node begin="1" cat="np" end="3" id="3" rel="nucl">
          <node begin="1" end="2" frame="number(hoofd(sg_num))" id="4" infl="sg_num" lcat="detp" lemma="een" numtype="hoofd" pos="num" positie="vrij" postag="TW(hoofd,vrij)" pt="tw" rel="det" root="één" sense="één" special="hoofd" word="één"/>
          <node begin="2" end="3" frame="noun(de,count,sg)" gen="de" genus="zijd" getal="ev" graad="basis" id="5" lcat="np" lemma="printer" naamval="stan" ntype="soort" num="sg" pos="noun" postag="N(soort,ev,basis,zijd,stan)" pt="n" rel="hd" root="printer" sense="printer" word="printer"/>
        </node>
      </node>
      <node begin="3" end="4" frame="punct(punt)" id="6" lcat="punct" lemma="." pos="punct" postag="LET()" pt="let" rel="--" root="." sense="." special="punt" word="."/>
    </node>
    <sentence>en één printer .</sentence>
    <comments>
      <comment>Q#WR-P-P-D-0000000006.p.34.s.1|en één printer .|1|1|1.2960516563900006</comment>
    </comments>
  </alpino_ds>
  <alpino_ds version="1.3" id="WR-P-P-D-0000000006.p.34.s.2">
    <node begin="0" cat="top" end="20" id="0" rel="top">
      <node begin="0" cat="smain" end="19" id="1" rel="--">
        <node begin="0" cat="np" end="2" id="2" index="1" rel="su">
          <node begin="0" end="1" frame="determiner(de,nwh,nmod,pro,nparg)" getal="getal" id="3" infl="de" lcat="detp" lemma="die" naamval="stan" pdtype="pron" persoon="3" pos="det" postag="VNW(aanw,pron,stan,vol,3,getal)" pt="vnw" rel="det" root="die" sense="die" status="vol" vwtype="aanw" wh="nwh" word="Die"/>
          <node begin="1" end="2" frame="noun(de,count,sg)" gen="de" genus="zijd" getal="ev" graad="basis" id="4" lcat="np" lemma="printer" naamval="stan" ntype="soort" num="sg" pos="noun" postag="N(soort,ev,basis,zijd,stan)" pt="n" rel="hd" root="printer" sense="printer" word="printer"/>
        </node>
        <node begin="2" end="3" frame="verb(unacc,sg3,passive)" id="5" infl="sg3" lcat="smain" lemma="worden" pos="verb" postag="WW(pv,tgw,met-t)" pt="ww" pvagr="met-t" pvtijd="tgw" rel="hd" root="word" sc="passive" sense="word" tense="present" word="wordt" wvorm="pv"/>
        <node begin="0" cat="ppart" end="19" id="6" rel="vc">
          <node begin="0" end="2" id="7" index="1" rel="obj1"/>
          <node begin="3" buiging="zonder" end="4" frame="verb(hebben,psp,np_pc_pp(voor))" id="8" infl="psp" lcat="ppart" lemma="gebruiken" pos="verb" positie="vrij" postag="WW(vd,vrij,zonder)" pt="ww" rel="hd" root="gebruik" sc="np_pc_pp(voor)" sense="gebruik-voor" word="gebruikt" wvorm="vd"/>
          <node begin="4" cat="pp" end="19" id="9" rel="pc">
            <node begin="4" end="5" frame="preposition(voor,[aan,door,uit,[in,de,plaats]])" id="10" lcat="pp" lemma="voor" pos="prep" postag="VZ(init)" pt="vz" rel="hd" root="voor" sense="voor" vztype="init" word="voor"/>
            <node begin="5" cat="np" end="19" id="11" rel="obj1">
              <node begin="5" end="6" frame="determiner(het,nwh,nmod,pro,nparg,wkpro)" id="12" infl="het" lcat="detp" lemma="het" lwtype="bep" naamval="stan" npagr="evon" pos="det" postag="LID(bep,stan,evon)" pt="lid" rel="det" root="het" sense="het" wh="nwh" word="het"/>
              <node begin="6" end="7" frame="v_noun(intransitive)" getal="mv" graad="basis" id="13" lcat="np" lemma="druk" ntype="soort" pos="verb" postag="N(soort,mv,basis)" pt="n" rel="hd" root="druk" sc="intransitive" sense="druk" special="v_noun" word="drukken"/>
              <node begin="7" cat="pp" end="19" id="14" rel="mod">
                <node begin="7" end="8" frame="preposition(van,[af,uit,vandaan,[af,aan]])" id="15" lcat="pp" lemma="van" pos="prep" postag="VZ(init)" pt="vz" rel="hd" root="van" sense="van" vztype="init" word="van"/>
                <node begin="8" cat="np" end="19" id="16" rel="obj1">
                  <node begin="8" end="9" frame="determiner(de)" id="17" infl="de" lcat="detp" lemma="de" lwtype="bep" naamval="stan" npagr="rest" pos="det" postag="LID(bep,stan,rest)" pt="lid" rel="det" root="de" sense="de" word="de"/>
                  <node begin="9" end="10" frame="noun(de,count,sg)" gen="de" genus="zijd" getal="ev" graad="basis" id="18" lcat="np" lemma="tekst" naamval="stan" ntype="soort" num="sg" pos="noun" postag="N(soort,ev,basis,zijd,stan)" pt="n" rel="hd" root="tekst" sense="tekst" word="tekst"/>
                  <node begin="10" cat="pp" end="19" id="19" rel="mod">
                    <node begin="10" end="11" frame="preposition(van,[af,uit,vandaan,[af,aan]])" id="20" lcat="pp" lemma="van" pos="prep" postag="VZ(init)" pt="vz" rel="hd" root="van" sense="van" vztype="init" word="van"/>
                    <node begin="11" cat="conj" end="19" id="21" rel="obj1">
                      <node begin="14" conjtype="neven" end="15" frame="conj(en)" id="22" lcat="vg" lemma="en" pos="vg" postag="VG(neven)" pt="vg" rel="crd" root="en" sense="en" word="en"/>
                      <node begin="11" cat="np" end="19" id="23" rel="cnj">
                        <node begin="11" end="12" frame="modal_adverb" id="24" index="2" lcat="advp" lemma="bijvoorbeeld" pos="adv" postag="BW()" pt="bw" rel="mod" root="bijvoorbeeld" sc="modal" sense="bijvoorbeeld" word="bijvoorbeeld"/>
                        <node begin="12" end="13" frame="determiner(de)" id="25" index="3" infl="de" lcat="detp" lemma="de" lwtype="bep" naamval="stan" npagr="rest" pos="det" postag="LID(bep,stan,rest)" pt="lid" rel="det" root="de" sense="de" word="de"/>
                        <node begin="13" end="14" frame="noun(de,count,sg)" gen="de" genus="zijd" getal="ev" graad="basis" id="26" lcat="np" lemma="naam" naamval="stan" ntype="soort" num="sg" pos="noun" postag="N(soort,ev,basis,zijd,stan)" pt="n" rel="hd" root="naam" sense="naam" word="naam"/>
                        <node begin="16" cat="pp" end="19" id="27" index="4" rel="mod">
                          <node begin="16" end="17" frame="preposition(op,[af,na])" id="28" lcat="pp" lemma="op" pos="prep" postag="VZ(init)" pt="vz" rel="hd" root="op" sense="op" vztype="init" word="op"/>
                          <node begin="17" cat="np" end="19" id="29" rel="obj1">
                            <node begin="17" end="18" frame="determiner(de)" id="30" infl="de" lcat="detp" lemma="de" lwtype="bep" naamval="stan" npagr="rest" pos="det" postag="LID(bep,stan,rest)" pt="lid" rel="det" root="de" sense="de" word="de"/>
                            <node begin="18" end="19" frame="noun(de,count,sg)" gen="de" genus="zijd" getal="ev" graad="basis" id="31" lcat="np" lemma="cd" naamval="stan" ntype="soort" num="sg" pos="noun" postag="N(soort,ev,basis,zijd,stan)" pt="n" rel="hd" root="cd" sense="cd" word="cd"/>
                          </node>
                        </node>
                      </node>
                      <node begin="11" cat="np" end="19" id="32" rel="cnj">
                        <node begin="11" end="12" id="33" index="2" rel="mod"/>
                        <node begin="12" end="13" id="34" index="3" rel="det"/>
                        <node begin="15" end="16" frame="noun(het,count,pl)" gen="het" getal="mv" graad="basis" id="35" lcat="np" lemma="adresgegevens" ntype="soort" num="pl" pos="noun" postag="N(soort,mv,basis)" pt="n" rel="hd" root="adres_gegeven" sense="adres_gegeven" word="adresgegevens"/>
                        <node begin="16" end="19" id="36" index="4" rel="mod"/>
                      </node>
                    </node>
                  </node>
                </node>
              </node>
            </node>
          </node>
        </node>
      </node>
      <node begin="19" end="20" frame="punct(punt)" id="37" lcat="punct" lemma="." pos="punct" postag="LET()" pt="let" rel="--" root="." sense="." special="punt" word="."/>
    </node>
    <sentence>Die printer wordt gebruikt voor het drukken van de tekst van bijvoorbeeld de naam en adresgegevens op de cd .</sentence>
    <comments>
      <comment>Q#WR-P-P-D-0000000006.p.34.s.2|Die printer wordt gebruikt voor het drukken van de tekst van bijvoorbeeld de naam en adresgegevens op de cd .|1|1|0.11022457209000547</comment>
    </comments>
  </alpino_ds>
  <alpino_ds version="1.3" id="WR-P-P-D-0000000006.p.34.s.3">
    <node begin="0" cat="top" end="25" id="0" rel="top">
      <node begin="15" end="16" frame="punct(komma)" id="1" lcat="punct" lemma="," pos="punct" postag="LET()" pt="let" rel="--" root="," sense="," special="komma" word=","/>
      <node begin="22" end="23" frame="punct(komma)" id="2" lcat="punct" lemma="," pos="punct" postag="LET()" pt="let" rel="--" root="," sense="," special="komma" word=","/>
      <node begin="0" cat="smain" end="25" id="3" rel="--">
        <node begin="0" cat="np" end="2" id="4" rel="su">
          <node begin="0" end="1" frame="determiner(een)" id="5" infl="een" lcat="detp" lemma="een" lwtype="onbep" naamval="stan" npagr="agr" pos="det" postag="LID(onbep,stan,agr)" pt="lid" rel="det" root="een" sense="een" word="Een"/>
          <node begin="1" end="2" frame="noun(het,count,sg)" gen="het" genus="onz" getal="ev" graad="dim" id="6" lcat="np" lemma="robot-arm" naamval="stan" ntype="soort" num="sg" pos="noun" postag="N(soort,ev,dim,onz,stan)" pt="n" rel="hd" root="robot_arm_DIM" sense="robot_arm_DIM" word="robot-armpje"/>
        </node>
        <node begin="2" end="3" frame="verb(hebben,sg3,er_pp_sbar(voor))" id="7" infl="sg3" lcat="smain" lemma="zorgen" pos="verb" postag="WW(pv,tgw,met-t)" pt="ww" pvagr="met-t" pvtijd="tgw" rel="hd" root="zorg" sc="er_pp_sbar(voor)" sense="zorg-voor" tense="present" word="zorgt" wvorm="pv"/>
        <node begin="3" cat="pp" end="25" id="8" rel="pc">
          <node begin="3" end="4" frame="er_adverb(voor)" id="9" lcat="pp" lemma="ervoor" pos="pp" postag="BW()" pt="bw" rel="hd" root="ervoor" sense="ervoor" special="er" word="ervoor"/>
          <node begin="4" cat="cp" end="25" id="10" rel="vc">
            <node begin="4" conjtype="onder" end="5" frame="complementizer(dat)" id="11" lcat="cp" lemma="dat" pos="comp" postag="VG(onder)" pt="vg" rel="cmp" root="dat" sc="dat" sense="dat" word="dat"/>
            <node begin="5" cat="conj" end="25" id="12" rel="body">
              <node begin="5" cat="ssub" end="13" id="13" rel="cnj">
                <node begin="5" cat="np" end="7" id="14" index="1" rel="su">
                  <node begin="5" end="6" frame="determiner(de)" id="15" infl="de" lcat="detp" lemma="de" lwtype="bep" naamval="stan" npagr="rest" pos="det" postag="LID(bep,stan,rest)" pt="lid" rel="det" root="de" sense="de" word="de"/>
                  <node begin="6" end="7" frame="noun(de,count,pl)" gen="de" getal="mv" graad="basis" id="16" lcat="np" lemma="brander" ntype="soort" num="pl" pos="noun" postag="N(soort,mv,basis)" pt="n" rel="hd" root="brander" sense="brander" word="branders"/>
                </node>
                <node begin="9" end="10" frame="verb(unacc,pl,passive)" id="17" infl="pl" lcat="ssub" lemma="worden" pos="verb" postag="WW(pv,tgw,mv)" pt="ww" pvagr="mv" pvtijd="tgw" rel="hd" root="word" sc="passive" sense="word" tense="present" word="worden" wvorm="pv"/>
                <node begin="5" cat="ppart" end="13" id="18" rel="vc">
                  <node begin="5" end="7" id="19" index="1" rel="obj1"/>
                  <node begin="7" end="8" frame="adverb" id="20" lcat="advp" lemma="steeds" pos="adv" postag="BW()" pt="bw" rel="mod" root="steeds" sense="steeds" word="steeds"/>
                  <node begin="8" buiging="zonder" end="9" frame="verb(hebben,psp,np_pc_pp(met))" id="21" infl="psp" lcat="ppart" lemma="laden" pos="verb" positie="vrij" postag="WW(vd,vrij,zonder)" pt="ww" rel="hd" root="laad" sc="np_pc_pp(met)" sense="laad-met" word="geladen" wvorm="vd"/>
                  <node begin="10" cat="pp" end="13" id="22" rel="pc">
                    <node begin="10" end="11" frame="preposition(met,[mee,[en,al]])" id="23" lcat="pp" lemma="met" pos="prep" postag="VZ(init)" pt="vz" rel="hd" root="met" sense="met" vztype="init" word="met"/>
                    <node begin="11" cat="np" end="13" id="24" rel="obj1">
                      <node aform="base" begin="11" buiging="met-e" end="12" frame="adjective(e)" graad="basis" id="25" infl="e" lcat="ap" lemma="leeg" naamval="stan" pos="adj" positie="prenom" postag="ADJ(prenom,basis,met-e,stan)" pt="adj" rel="mod" root="leeg" sense="leeg" vform="adj" word="lege"/>
                      <node begin="12" end="13" frame="noun(de,count,pl)" gen="de" getal="mv" graad="basis" id="26" lcat="np" lemma="cd" ntype="soort" num="pl" pos="noun" postag="N(soort,mv,basis)" pt="n" rel="hd" root="cd" sense="cd" word="cd&apos;s"/>
                    </node>
                  </node>
                </node>
              </node>
              <node begin="13" conjtype="neven" end="14" frame="conj(en)" id="27" lcat="vg" lemma="en" pos="vg" postag="VG(neven)" pt="vg" rel="crd" root="en" sense="en" word="en"/>
              <node begin="14" cat="ssub" end="25" id="28" rel="cnj">
                <node begin="14" end="15" frame="determiner(het,nwh,nmod,pro,nparg)" getal="ev" id="29" infl="het" lcat="np" lemma="dat" naamval="stan" pdtype="pron" persoon="3o" pos="det" postag="VNW(aanw,pron,stan,vol,3o,ev)" pt="vnw" rel="su" root="dat" sense="dat" status="vol" vwtype="aanw" wh="nwh" word="dat"/>
                <node begin="16" cat="cp" end="22" id="30" rel="mod">
                  <node begin="16" conjtype="onder" end="17" frame="complementizer(als)" id="31" lcat="cp" lemma="als" pos="comp" postag="VG(onder)" pt="vg" rel="cmp" root="als" sc="als" sense="als" word="als"/>
                  <node begin="17" cat="ssub" end="22" id="32" rel="body">
                    <node begin="17" case="both" def="def" end="18" frame="pronoun(nwh,thi,both,de,both,def,wkpro)" gen="de" getal="mv" id="33" index="2" lcat="np" lemma="ze" naamval="stan" num="both" pdtype="pron" per="thi" persoon="3" pos="pron" postag="VNW(pers,pron,stan,red,3,mv)" pt="vnw" rel="su" root="ze" sense="ze" special="wkpro" status="red" vwtype="pers" wh="nwh" word="ze"/>
                    <node begin="19" end="20" frame="verb(unacc,pl,passive)" id="34" infl="pl" lcat="ssub" lemma="zijn" pos="verb" postag="WW(pv,tgw,mv)" pt="ww" pvagr="mv" pvtijd="tgw" rel="hd" root="ben" sc="passive" sense="ben" tense="present" word="zijn" wvorm="pv"/>
                    <node begin="17" cat="ppart" end="22" id="35" rel="vc">
                      <node begin="17" end="18" id="36" index="2" rel="obj1"/>
                      <node begin="18" end="19" frame="verb(hebben,psp,np_pc_pp(van))" id="37" infl="psp" lcat="ppart" lemma="voorzien" pos="verb" postag="WW(pv,tgw,mv)" pt="ww" pvagr="mv" pvtijd="tgw" rel="hd" root="voorzie" sc="np_pc_pp(van)" sense="voorzie-van" word="voorzien" wvorm="pv"/>
                      <node begin="20" cat="pp" end="22" id="38" rel="pc">
                        <node begin="20" end="21" frame="preposition(van,[af,uit,vandaan,[af,aan]])" id="39" lcat="pp" lemma="van" pos="prep" postag="VZ(init)" pt="vz" rel="hd" root="van" sense="van" vztype="init" word="van"/>
                        <node begin="21" end="22" frame="noun(de,mass,sg)" gen="de" genus="zijd" getal="ev" graad="basis" id="40" lcat="np" lemma="audio" naamval="stan" ntype="soort" num="sg" pos="noun" postag="N(soort,ev,basis,zijd,stan)" pt="n" rel="obj1" root="audio" sense="audio" word="audio"/>
                      </node>
                    </node>
                  </node>
                </node>
                <node begin="23" case="both" def="def" end="24" frame="pronoun(nwh,thi,both,de,both,def,wkpro)" gen="de" getal="mv" id="41" lcat="np" lemma="ze" naamval="stan" num="both" pdtype="pron" per="thi" persoon="3" pos="pron" postag="VNW(pers,pron,stan,red,3,mv)" pt="vnw" rel="obj1" root="ze" sense="ze" special="wkpro" status="red" vwtype="pers" wh="nwh" word="ze"/>
                <node begin="24" buiging="zonder" end="25" frame="verb(hebben,sg3,transitive)" id="42" infl="sg3" lcat="ssub" lemma="verplaatsen" pos="verb" positie="vrij" postag="WW(vd,vrij,zonder)" pt="ww" rel="hd" root="verplaats" sc="transitive" sense="verplaats" tense="present" word="verplaatst" wvorm="vd"/>
              </node>
            </node>
          </node>
        </node>
      </node>
    </node>
    <sentence>Een robot-armpje zorgt ervoor dat de branders steeds geladen worden met lege cd&apos;s en dat , als ze voorzien zijn van audio , ze verplaatst</sentence>
    <comments>
      <comment>Q#WR-P-P-D-0000000006.p.34.s.3|Een robot-armpje zorgt ervoor dat de branders steeds geladen worden met lege cd&apos;s en dat , als ze voorzien zijn van audio , ze verplaatst|1|1|-0.4347218970399951</comment>
    </comments>
  </alpino_ds>
  </treebank>
"""

xpath = '//node[@cat="cp" and node[@rel="cmp" and @pt="vg" and number(@begin) < number(../node[@rel="body" and @cat="ssub"]/node[@rel="vc" and @cat="ppart"]/node[@rel="hd" and @pt="ww"]/@begin)] and node[@rel="body" and @cat="ssub" and node[@rel="vc" and @cat="ppart" and node[@rel="hd" and @pt="ww" and number(@begin) < number(../../node[@rel="hd" and @pt="ww"]/@begin)]] and node[@rel="hd" and @pt="ww"]]]'


for _, element in ET.iterparse(BytesIO(str.encode(xml)), tag="alpino_ds", events=("end", )):
    result = element.xpath(xpath)
    if result:
        print("match", ET.tostring(result[0]))

我在這里錯過了什么?

對於 XPath,以/開頭的絕對路徑從文檔節點(有時也稱為根節點)向下搜索,如果您以例如//node開頭,則文檔中任何位置的 select node元素(上下文節點中的xpath function ).

因此,對於 select 相對於/在您選擇的alpine_ds元素內部,使用以.//node開頭的路徑而不是//node

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM