简体   繁体   中英

xpath to select from child element to end of parent

I am trying to do this using lxml, but utlimately it is a question about the proper xpath. I'd like to select from the <pgBreak> element until the end of its parent, in this case <p >

XML IN:

  <root>
     <pgBreak pgId="1"/>
      <p>
         some text to fill out a para
           <pgBreak pgId="2"/>
            some more text 
            <quote> A quoted block </quote>
            remainder of para
      </p>
    </root>

XML OUT:

  <root>
     <pgBreak pgId="1"/>
      <p>
         some text to fill out a para
       </p>
          <pgBreak pgId="2"/>
       <p>
             some more text 
            <quote> A quoted block </quote>
            remainder of para
      </p>
    </root>

What you are trying to do is not trivial: not only do you want to match 'pgBreak' elements and all subsequent siblings, you then want to move them outside of the parent scope and wrap the siblings in a 'p' element. Fun stuff.

The following code should give you an idea how to achieve that (DISCLAIMER: example only, needs clean-up, edge cases probably not handled). Code is deliberately uncommented so you have to figure it out :)

I've modified the input XML slightly to illustrate the functionality better.

import lxml.etree

text = """
<root>
  <pgBreak pgId="1"/>
  <p>
    some text to fill out a para
    <pgBreak pgId="2"/>
    some more text 
    <quote> A quoted block </quote>
    remainder of para
    <pgBreak pgId="3"/>
    <p>
       blurb
    </p>
  </p>
</root>
"""

root = lxml.etree.fromstring(text)
for pgbreak in root.xpath('//pgBreak'):
    inner = pgbreak.getparent()
    if inner == root:
        continue
    outer = inner.getparent()
    pgbreak_index = inner.index(pgbreak)
    inner_index = outer.index(inner) + 1
    siblings = inner[pgbreak_index+1:]
    inner.remove(pgbreak)
    outer.insert(inner_index,pgbreak)
    if siblings[0].tag != 'p':
        p = lxml.etree.Element('p')
        p.text = pgbreak.tail
        pgbreak.tail = None
        for node in siblings:
            p.append(node)
        outer.insert(inner_index+1,p)
    else:
        for node in siblings:
            inner_index += 1
            outer.insert(inner_index,node)

Output is:

<root>
  <pgBreak pgId="1"/>
  <p>
    some text to fill out a para
  </p>
  <pgBreak pgId="2"/>
  <p>
    some more text 
    <quote> A quoted block </quote>
    remainder of para
  </p>
  <pgBreak pgId="3"/>
  <p>
    blurb
  </p>
</root>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM