如何在 Python 中使用 XML 和 xPath 的前同级兄弟？

Question

I have an XML structured like this:我有一个 XML 结构如下：

<?xml version="1.0" encoding="utf-8"?>
<pages>
    <page id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
        <textbox id="0" bbox="179.739,592.028,261.007,604.510">
            <textline bbox="179.739,592.028,261.007,604.510">
                <text font="NUMPTY+ImprintMTnum"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">C</text>
                <text font="NUMPTY+ImprintMTnum-it"  bbox="192.745,592.218,199.339,603.578" ncolour="0" size="12.333">A</text>
                <text font="NUMPTY+ImprintMTnum-it"  bbox="193.745,592.218,199.339,603.578" ncolour="0" size="12.333">P</text>
                <text font="NUMPTY+ImprintMTnum-it"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.333">I</text>
                <text font="NUMPTY+ImprintMTnum"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">T</text>
                <text font="NUMPTY+ImprintMTnum"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">O</text>
                <text font="NUMPTY+ImprintMTnum"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">L</text>
                <text font="NUMPTY+ImprintMTnum"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">O</text>
                <text></text>
                <text font="NUMPTY+ImprintMTnum"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">I</text>
                <text font="NUMPTY+ImprintMTnum"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">I</text>
                <text font="NUMPTY+ImprintMTnum"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">I</text>
                <text></text>
            </textline>
        </textbox>
    </page>
</pages>

Attribute bbox in text tag has four values, and I need to have the difference of the first bbox value of an element and its preceding one.文本标签中的属性 bbox 有四个值，我需要一个元素的第一个 bbox 值与其前一个值的差异。 In other words, the distance between the first two bboxes is 1. In the following loop, I need to find the preceding sibling of the bbox attribute value I take in order to calculate the distance between the two.换句话说，前两个bbox之间的距离为1。在下面的循环中，我需要找到我取的bbox属性值的前一个兄弟，以便计算两者之间的距离。

   def wrap(line, idxList):
        if len(idxList) == 0:
            return    # No elements to wrap
        # Take the first element from the original location
        idx = idxList.pop(0)     # Index of the first element
        elem = removeByIdx(line, idx) # The indicated element
        # Create "newline" element with "elem" inside
        nElem = E.newline(elem)
        line.insert(idx, nElem)  # Put it in place of "elem"
        while len(idxList) > 0:  # Process the rest of index list
            # Value not used, but must be removed
            idxList.pop(0)
            # Remove the current element from the original location
            currElem = removeByIdx(line, idx + 1)
            nElem.append(currElem)  # Append it to "newline"

    for line in root.iter('textline'):
        idxList = []
        for elem in line:
            bbox = elem.attrib.get('bbox')
            if bbox is not None:
                tbl = bbox.split(',')
                distance = float(tbl[2]) - float(tbl[0])
            else:
                distance = 100  # "Too big" value
            if distance > 10:
                par = elem.getparent()
                idx = par.index(elem)
                idxList.append(idx)
            else:  # "Wrong" element, wrap elements "gathered" so far
                wrap(line, idxList)
                idxList = []
        # Process "good" elements without any "bad" after them, if any
        wrap(line, idxList)

    #print(etree.tostring(root, encoding='unicode', pretty_print=True))

I tried with xPath like this:我尝试使用 xPath 像这样：

for x in tree.xpath("//text[@bbox<preceding::text[1]/@bbox+11]"):
print(x)

But it returns nothing.但它什么也不返回。 Is my path wrong and how can I insert it in the loop?我的路径是否错误，如何将其插入循环中？

Answer 1

The reason your code failed is that the axis name concerning preceding siblings is preceding-sibling (not preceding ).您的代码失败的原因是与前面的兄弟姐妹有关的轴名称是前面的兄弟姐妹（不是前面的）。

But here you don't need to use XPath expressions, as there is native lxml method to get the (first) preceding sibling called getprevious .但是在这里你不需要使用XPath表达式，因为有本地lxml方法来获取（第一个）前面的兄弟，称为getprevious 。

To check access to previous text node, try the following loop:要检查对先前文本节点的访问，请尝试以下循环：

for x in tree.xpath('//text'):
    bb = x.attrib.get('bbox')
    if bb is not None:
        bb = bb.split(',')
    print('This: ', bb)
    xPrev = x.getprevious()
    bb = None
    if xPrev is not None:
        bb = xPrev.attrib.get('bbox')
        if bb is not None:
            bb = bb.split(',')
    if bb is not None:
        print('  Previous: ', bb)
    else:
        print('  No previous bbox')

It prints bbox for the current text element and for the immediately preceding sibling if any.它为当前文本元素和前一个兄弟元素（如果有）打印bbox 。

Edit编辑

If you want, you can also directly access bbox attribute in the preceding text element, calling x.xpath('preceding-sibling::text[1]/@bbox') .如果需要，也可以直接访问前面文本元素中的bbox属性，调用x.xpath('preceding-sibling::text[1]/@bbox') 。

But remember that this function returns a list of found nodes and if nothing has been found, this list is empty (not None ).但是请记住，这个 function 返回一个找到的节点列表，如果没有找到，这个列表是空的（不是None ）。

So before you make any use of this result, you must:因此，在您使用此结果之前，您必须：

check the length of the returned list (should be > 0),检查返回列表的长度（应该> 0），
retrieve the first element from this list (the text content of bbox attribute, in this case this list should contain only 1 element),从这个列表中检索第一个元素（ bbox属性的文本内容，在这种情况下，这个列表应该只包含一个元素），
split it by , (getting a list of fragments),将其拆分为, （获取片段列表），
check whether the first element of this result is not empty,检查此结果的第一个元素是否不为空，
convert of to float .转换为float 。

After that you can use it, eg compare with the corresponding value from the current bbox .之后您可以使用它，例如与当前bbox中的相应值进行比较。

Answer 2

Python uses the very old XPath 1.0 standard. Python 使用非常旧的 XPath 1.0 标准。 In XPath 1.0, the "<" operator always converts its operands to numbers.在 XPath 1.0 中，“<”运算符始终将其操作数转换为数字。 So when you write所以当你写

//text[@bbox < preceding::text[1]/@bbox + 11]

you are performing numeric differencing and numeric addition on @bbox values.您正在对@bbox值执行数字差分和数字加法。

But @bbox is not a number, it is a comma-separated list of four numbers:但@bbox不是一个数字，它是一个逗号分隔的四个数字列表：

179.739,592.028,261.007,604.510

Converting that to a number produces NaN (not-a-number), and NaN < NaN returns false.将其转换为数字会产生 NaN（非数字），并且NaN < NaN返回 false。

To do anything useful with a structured attribute value like this, you really need XPath 2.0 or later.要对这样的结构化属性值做任何有用的事情，您确实需要 XPath 2.0 或更高版本。

如何在 Python 中使用 XML 和 xPath 的前同级兄弟？

问题描述

2 个解决方案

解决方案1
1 已采纳 2020-04-15 09:32:45

Edit编辑

解决方案2
1 2020-04-15 10:51:59

如何在 Python 中使用 XML 和 xPath 的前同级兄弟？

问题描述

2 个解决方案

解决方案1 1 已采纳 2020-04-15 09:32:45

Edit编辑

解决方案2 1 2020-04-15 10:51:59

解决方案1
1 已采纳 2020-04-15 09:32:45

解决方案2
1 2020-04-15 10:51:59