I have an XML structured like this:
<?xml version="1.0" encoding="utf-8"?>
<pages>
<page id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
<textbox id="0" bbox="179.739,592.028,261.007,604.510">
<textline bbox="179.739,592.028,261.007,604.510">
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">C</text>
<text font="NUMPTY+ImprintMTnum-it" bbox="192.745,592.218,199.339,603.578" ncolour="0" size="12.333">A</text>
<text font="NUMPTY+ImprintMTnum-it" bbox="193.745,592.218,199.339,603.578" ncolour="0" size="12.333">P</text>
<text font="NUMPTY+ImprintMTnum-it" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.333">I</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">T</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">O</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">L</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">O</text>
<text></text>
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">I</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">I</text>
<text font="NUMPTY+ImprintMTnum" bbox="191.745,592.218,199.339,603.578" ncolour="0" size="12.482">I</text>
<text></text>
</textline>
</textbox>
</page>
</pages>
Attribute bbox in text tag has four values, and I need to have the difference of the first bbox value of an element and its preceding one. In other words, the distance between the first two bboxes is 1. In the following loop, I need to find the preceding sibling of the bbox attribute value I take in order to calculate the distance between the two.
def wrap(line, idxList):
if len(idxList) == 0:
return # No elements to wrap
# Take the first element from the original location
idx = idxList.pop(0) # Index of the first element
elem = removeByIdx(line, idx) # The indicated element
# Create "newline" element with "elem" inside
nElem = E.newline(elem)
line.insert(idx, nElem) # Put it in place of "elem"
while len(idxList) > 0: # Process the rest of index list
# Value not used, but must be removed
idxList.pop(0)
# Remove the current element from the original location
currElem = removeByIdx(line, idx + 1)
nElem.append(currElem) # Append it to "newline"
for line in root.iter('textline'):
idxList = []
for elem in line:
bbox = elem.attrib.get('bbox')
if bbox is not None:
tbl = bbox.split(',')
distance = float(tbl[2]) - float(tbl[0])
else:
distance = 100 # "Too big" value
if distance > 10:
par = elem.getparent()
idx = par.index(elem)
idxList.append(idx)
else: # "Wrong" element, wrap elements "gathered" so far
wrap(line, idxList)
idxList = []
# Process "good" elements without any "bad" after them, if any
wrap(line, idxList)
#print(etree.tostring(root, encoding='unicode', pretty_print=True))
I tried with xPath like this:
for x in tree.xpath("//text[@bbox<preceding::text[1]/@bbox+11]"):
print(x)
But it returns nothing. Is my path wrong and how can I insert it in the loop?
The reason your code failed is that the axis name concerning preceding siblings is preceding-sibling (not preceding ).
But here you don't need to use XPath expressions, as there is native lxml method to get the (first) preceding sibling called getprevious .
To check access to previous text node, try the following loop:
for x in tree.xpath('//text'):
bb = x.attrib.get('bbox')
if bb is not None:
bb = bb.split(',')
print('This: ', bb)
xPrev = x.getprevious()
bb = None
if xPrev is not None:
bb = xPrev.attrib.get('bbox')
if bb is not None:
bb = bb.split(',')
if bb is not None:
print(' Previous: ', bb)
else:
print(' No previous bbox')
It prints bbox for the current text element and for the immediately preceding sibling if any.
If you want, you can also directly access bbox attribute in the preceding text element, calling x.xpath('preceding-sibling::text[1]/@bbox') .
But remember that this function returns a list of found nodes and if nothing has been found, this list is empty (not None ).
So before you make any use of this result, you must:
,
(getting a list of fragments),After that you can use it, eg compare with the corresponding value from the current bbox .
Python uses the very old XPath 1.0 standard. In XPath 1.0, the "<" operator always converts its operands to numbers. So when you write
//text[@bbox < preceding::text[1]/@bbox + 11]
you are performing numeric differencing and numeric addition on @bbox
values.
But @bbox
is not a number, it is a comma-separated list of four numbers:
179.739,592.028,261.007,604.510
Converting that to a number produces NaN (not-a-number), and NaN < NaN
returns false.
To do anything useful with a structured attribute value like this, you really need XPath 2.0 or later.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.