如何在 Python 中使用 XML 和 xPath 的前同级兄弟？

Question

我有一个 XML 结构如下：

<?xml version="1.0" encoding="utf-8"?>
<pages>
    <page id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
        <textbox id="0" bbox="179.739,592.028,261.007,604.510">
            <textline bbox="179.739,592.028,261.007,604.510">
                <text font="NUMPTY+ImprintMTnum"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">C</text>
                <text font="NUMPTY+ImprintMTnum-it"  bbox="192.745,592.218,199.339,603.578" ncolour="0" size="12.333">A</text>
                <text font="NUMPTY+ImprintMTnum-it"  bbox="193.745,592.218,199.339,603.578" ncolour="0" size="12.333">P</text>
                <text font="NUMPTY+ImprintMTnum-it"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.333">I</text>
                <text font="NUMPTY+ImprintMTnum"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">T</text>
                <text font="NUMPTY+ImprintMTnum"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">O</text>
                <text font="NUMPTY+ImprintMTnum"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">L</text>
                <text font="NUMPTY+ImprintMTnum"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">O</text>
                <text></text>
                <text font="NUMPTY+ImprintMTnum"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">I</text>
                <text font="NUMPTY+ImprintMTnum"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">I</text>
                <text font="NUMPTY+ImprintMTnum"  bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">I</text>
                <text></text>
            </textline>
        </textbox>
    </page>
</pages>

文本标签中的属性 bbox 有四个值，我需要一个元素的第一个 bbox 值与其前一个值的差异。 换句话说，前两个bbox之间的距离为1。在下面的循环中，我需要找到我取的bbox属性值的前一个兄弟，以便计算两者之间的距离。

   def wrap(line, idxList):
        if len(idxList) == 0:
            return    # No elements to wrap
        # Take the first element from the original location
        idx = idxList.pop(0)     # Index of the first element
        elem = removeByIdx(line, idx) # The indicated element
        # Create "newline" element with "elem" inside
        nElem = E.newline(elem)
        line.insert(idx, nElem)  # Put it in place of "elem"
        while len(idxList) > 0:  # Process the rest of index list
            # Value not used, but must be removed
            idxList.pop(0)
            # Remove the current element from the original location
            currElem = removeByIdx(line, idx + 1)
            nElem.append(currElem)  # Append it to "newline"

    for line in root.iter('textline'):
        idxList = []
        for elem in line:
            bbox = elem.attrib.get('bbox')
            if bbox is not None:
                tbl = bbox.split(',')
                distance = float(tbl[2]) - float(tbl[0])
            else:
                distance = 100  # "Too big" value
            if distance > 10:
                par = elem.getparent()
                idx = par.index(elem)
                idxList.append(idx)
            else:  # "Wrong" element, wrap elements "gathered" so far
                wrap(line, idxList)
                idxList = []
        # Process "good" elements without any "bad" after them, if any
        wrap(line, idxList)

    #print(etree.tostring(root, encoding='unicode', pretty_print=True))

我尝试使用 xPath 像这样：

for x in tree.xpath("//text[@bbox<preceding::text[1]/@bbox+11]"):
print(x)

但它什么也不返回。 我的路径是否错误，如何将其插入循环中？

Answer 1

您的代码失败的原因是与前面的兄弟姐妹有关的轴名称是前面的兄弟姐妹（不是前面的）。

但是在这里你不需要使用XPath表达式，因为有本地lxml方法来获取（第一个）前面的兄弟，称为getprevious 。

要检查对先前文本节点的访问，请尝试以下循环：

for x in tree.xpath('//text'):
    bb = x.attrib.get('bbox')
    if bb is not None:
        bb = bb.split(',')
    print('This: ', bb)
    xPrev = x.getprevious()
    bb = None
    if xPrev is not None:
        bb = xPrev.attrib.get('bbox')
        if bb is not None:
            bb = bb.split(',')
    if bb is not None:
        print('  Previous: ', bb)
    else:
        print('  No previous bbox')

它为当前文本元素和前一个兄弟元素（如果有）打印bbox 。

编辑

如果需要，也可以直接访问前面文本元素中的bbox属性，调用x.xpath('preceding-sibling::text[1]/@bbox') 。

但是请记住，这个 function 返回一个找到的节点列表，如果没有找到，这个列表是空的（不是None ）。

因此，在您使用此结果之前，您必须：

检查返回列表的长度（应该> 0），
从这个列表中检索第一个元素（ bbox属性的文本内容，在这种情况下，这个列表应该只包含一个元素），
将其拆分为, （获取片段列表），
检查此结果的第一个元素是否不为空，
转换为float 。

之后您可以使用它，例如与当前bbox中的相应值进行比较。

Answer 2

Python 使用非常旧的 XPath 1.0 标准。 在 XPath 1.0 中，“<”运算符始终将其操作数转换为数字。 所以当你写

//text[@bbox < preceding::text[1]/@bbox + 11]

您正在对@bbox值执行数字差分和数字加法。

但@bbox不是一个数字，它是一个逗号分隔的四个数字列表：

179.739,592.028,261.007,604.510

将其转换为数字会产生 NaN（非数字），并且NaN < NaN返回 false。

要对这样的结构化属性值做任何有用的事情，您确实需要 XPath 2.0 或更高版本。

如何在 Python 中使用 XML 和 xPath 的前同级兄弟？

问题描述

2 个解决方案

解决方案1
1 已采纳 2020-04-15 09:32:45

编辑

解决方案2
1 2020-04-15 10:51:59

如何在 Python 中使用 XML 和 xPath 的前同级兄弟？

问题描述

2 个解决方案

解决方案1 1 已采纳 2020-04-15 09:32:45

编辑

解决方案2 1 2020-04-15 10:51:59

解决方案1
1 已采纳 2020-04-15 09:32:45

解决方案2
1 2020-04-15 10:51:59