[英]How to use preceding sibling for XML with xPath in Python?
I have an XML structured like this:我有一个 XML 结构如下:
<?xml version="1.0" encoding="utf-8"?>
<pages>
<page id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
<textbox id="0" bbox="179.739,592.028,261.007,604.510">
<textline bbox="179.739,592.028,261.007,604.510">
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">C</text>
<text font="NUMPTY+ImprintMTnum-it" bbox="192.745,592.218,199.339,603.578" ncolour="0" size="12.333">A</text>
<text font="NUMPTY+ImprintMTnum-it" bbox="193.745,592.218,199.339,603.578" ncolour="0" size="12.333">P</text>
<text font="NUMPTY+ImprintMTnum-it" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.333">I</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">T</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">O</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">L</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">O</text>
<text></text>
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">I</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">I</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">I</text>
<text></text>
</textline>
</textbox>
</page>
</pages>
Attribute bbox in text tag has four values, and I need to have the difference of the first bbox value of an element and its preceding one.文本标签中的属性 bbox 有四个值,我需要一个元素的第一个 bbox 值与其前一个值的差异。 In other words, the distance between the first two bboxes is 1. In the following loop, I need to find the preceding sibling of the bbox attribute value I take in order to calculate the distance between the two.
换句话说,前两个bbox之间的距离为1。在下面的循环中,我需要找到我取的bbox属性值的前一个兄弟,以便计算两者之间的距离。
def wrap(line, idxList):
if len(idxList) == 0:
return # No elements to wrap
# Take the first element from the original location
idx = idxList.pop(0) # Index of the first element
elem = removeByIdx(line, idx) # The indicated element
# Create "newline" element with "elem" inside
nElem = E.newline(elem)
line.insert(idx, nElem) # Put it in place of "elem"
while len(idxList) > 0: # Process the rest of index list
# Value not used, but must be removed
idxList.pop(0)
# Remove the current element from the original location
currElem = removeByIdx(line, idx + 1)
nElem.append(currElem) # Append it to "newline"
for line in root.iter('textline'):
idxList = []
for elem in line:
bbox = elem.attrib.get('bbox')
if bbox is not None:
tbl = bbox.split(',')
distance = float(tbl[2]) - float(tbl[0])
else:
distance = 100 # "Too big" value
if distance > 10:
par = elem.getparent()
idx = par.index(elem)
idxList.append(idx)
else: # "Wrong" element, wrap elements "gathered" so far
wrap(line, idxList)
idxList = []
# Process "good" elements without any "bad" after them, if any
wrap(line, idxList)
#print(etree.tostring(root, encoding='unicode', pretty_print=True))
I tried with xPath like this:我尝试使用 xPath 像这样:
for x in tree.xpath("//text[@bbox<preceding::text[1]/@bbox+11]"):
print(x)
But it returns nothing.但它什么也不返回。 Is my path wrong and how can I insert it in the loop?
我的路径是否错误,如何将其插入循环中?
The reason your code failed is that the axis name concerning preceding siblings is preceding-sibling (not preceding ).您的代码失败的原因是与前面的兄弟姐妹有关的轴名称是前面的兄弟姐妹(不是前面的)。
But here you don't need to use XPath expressions, as there is native lxml method to get the (first) preceding sibling called getprevious .但是在这里你不需要使用XPath表达式,因为有本地lxml方法来获取(第一个)前面的兄弟,称为getprevious 。
To check access to previous text node, try the following loop:要检查对先前文本节点的访问,请尝试以下循环:
for x in tree.xpath('//text'):
bb = x.attrib.get('bbox')
if bb is not None:
bb = bb.split(',')
print('This: ', bb)
xPrev = x.getprevious()
bb = None
if xPrev is not None:
bb = xPrev.attrib.get('bbox')
if bb is not None:
bb = bb.split(',')
if bb is not None:
print(' Previous: ', bb)
else:
print(' No previous bbox')
It prints bbox for the current text element and for the immediately preceding sibling if any.它为当前文本元素和前一个兄弟元素(如果有)打印bbox 。
If you want, you can also directly access bbox attribute in the preceding text element, calling x.xpath('preceding-sibling::text[1]/@bbox') .如果需要,也可以直接访问前面文本元素中的bbox属性,调用x.xpath('preceding-sibling::text[1]/@bbox') 。
But remember that this function returns a list of found nodes and if nothing has been found, this list is empty (not None ).但是请记住,这个 function 返回一个找到的节点列表,如果没有找到,这个列表是空的(不是None )。
So before you make any use of this result, you must:因此,在您使用此结果之前,您必须:
,
(getting a list of fragments),,
(获取片段列表), After that you can use it, eg compare with the corresponding value from the current bbox .之后您可以使用它,例如与当前bbox中的相应值进行比较。
Python uses the very old XPath 1.0 standard. Python 使用非常旧的 XPath 1.0 标准。 In XPath 1.0, the "<" operator always converts its operands to numbers.
在 XPath 1.0 中,“<”运算符始终将其操作数转换为数字。 So when you write
所以当你写
//text[@bbox < preceding::text[1]/@bbox + 11]
you are performing numeric differencing and numeric addition on @bbox
values.您正在对
@bbox
值执行数字差分和数字加法。
But @bbox
is not a number, it is a comma-separated list of four numbers:但
@bbox
不是一个数字,它是一个逗号分隔的四个数字列表:
179.739,592.028,261.007,604.510
Converting that to a number produces NaN (not-a-number), and NaN < NaN
returns false.将其转换为数字会产生 NaN(非数字),并且
NaN < NaN
返回 false。
To do anything useful with a structured attribute value like this, you really need XPath 2.0 or later.要对这样的结构化属性值做任何有用的事情,您确实需要 XPath 2.0 或更高版本。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.