简体   繁体   中英

How to use preceding sibling for XML with xPath in Python?

I have an XML structured like this:

<?xml version="1.0" encoding="utf-8"?>
<pages>
    <page id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
        <textbox id="0" bbox="179.739,592.028,261.007,604.510">
            <textline bbox="179.739,592.028,261.007,604.510">
                <text font="NUMPTY+ImprintMTnum"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">C</text>
                <text font="NUMPTY+ImprintMTnum-it"  bbox="192.745,592.218,199.339,603.578" ncolour="0" size="12.333">A</text>
                <text font="NUMPTY+ImprintMTnum-it"  bbox="193.745,592.218,199.339,603.578" ncolour="0" size="12.333">P</text>
                <text font="NUMPTY+ImprintMTnum-it"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.333">I</text>
                <text font="NUMPTY+ImprintMTnum"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">T</text>
                <text font="NUMPTY+ImprintMTnum"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">O</text>
                <text font="NUMPTY+ImprintMTnum"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">L</text>
                <text font="NUMPTY+ImprintMTnum"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">O</text>
                <text></text>
                <text font="NUMPTY+ImprintMTnum"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">I</text>
                <text font="NUMPTY+ImprintMTnum"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">I</text>
                <text font="NUMPTY+ImprintMTnum"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">I</text>
                <text></text>
            </textline>
        </textbox>
    </page>
</pages>

Attribute bbox in text tag has four values, and I need to have the difference of the first bbox value of an element and its preceding one. In other words, the distance between the first two bboxes is 1. In the following loop, I need to find the preceding sibling of the bbox attribute value I take in order to calculate the distance between the two.

   def wrap(line, idxList):
        if len(idxList) == 0:
            return    # No elements to wrap
        # Take the first element from the original location
        idx = idxList.pop(0)     # Index of the first element
        elem = removeByIdx(line, idx) # The indicated element
        # Create "newline" element with "elem" inside
        nElem = E.newline(elem)
        line.insert(idx, nElem)  # Put it in place of "elem"
        while len(idxList) > 0:  # Process the rest of index list
            # Value not used, but must be removed
            idxList.pop(0)
            # Remove the current element from the original location
            currElem = removeByIdx(line, idx + 1)
            nElem.append(currElem)  # Append it to "newline"

    for line in root.iter('textline'):
        idxList = []
        for elem in line:
            bbox = elem.attrib.get('bbox')
            if bbox is not None:
                tbl = bbox.split(',')
                distance = float(tbl[2]) - float(tbl[0])
            else:
                distance = 100  # "Too big" value
            if distance > 10:
                par = elem.getparent()
                idx = par.index(elem)
                idxList.append(idx)
            else:  # "Wrong" element, wrap elements "gathered" so far
                wrap(line, idxList)
                idxList = []
        # Process "good" elements without any "bad" after them, if any
        wrap(line, idxList)

    #print(etree.tostring(root, encoding='unicode', pretty_print=True))

I tried with xPath like this:

for x in tree.xpath("//text[@bbox<preceding::text[1]/@bbox+11]"):
print(x)

But it returns nothing. Is my path wrong and how can I insert it in the loop?

The reason your code failed is that the axis name concerning preceding siblings is preceding-sibling (not preceding ).

But here you don't need to use XPath expressions, as there is native lxml method to get the (first) preceding sibling called getprevious .

To check access to previous text node, try the following loop:

for x in tree.xpath('//text'):
    bb = x.attrib.get('bbox')
    if bb is not None:
        bb = bb.split(',')
    print('This: ', bb)
    xPrev = x.getprevious()
    bb = None
    if xPrev is not None:
        bb = xPrev.attrib.get('bbox')
        if bb is not None:
            bb = bb.split(',')
    if bb is not None:
        print('  Previous: ', bb)
    else:
        print('  No previous bbox')

It prints bbox for the current text element and for the immediately preceding sibling if any.

Edit

If you want, you can also directly access bbox attribute in the preceding text element, calling x.xpath('preceding-sibling::text[1]/@bbox') .

But remember that this function returns a list of found nodes and if nothing has been found, this list is empty (not None ).

So before you make any use of this result, you must:

  • check the length of the returned list (should be > 0),
  • retrieve the first element from this list (the text content of bbox attribute, in this case this list should contain only 1 element),
  • split it by , (getting a list of fragments),
  • check whether the first element of this result is not empty,
  • convert of to float .

After that you can use it, eg compare with the corresponding value from the current bbox .

Python uses the very old XPath 1.0 standard. In XPath 1.0, the "<" operator always converts its operands to numbers. So when you write

//text[@bbox < preceding::text[1]/@bbox + 11]

you are performing numeric differencing and numeric addition on @bbox values.

But @bbox is not a number, it is a comma-separated list of four numbers:

179.739,592.028,261.007,604.510 

Converting that to a number produces NaN (not-a-number), and NaN < NaN returns false.

To do anything useful with a structured attribute value like this, you really need XPath 2.0 or later.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM