简体   繁体   中英

Is there a way using lxml.etree to skip the first entry or start the iteration at a specific child when parsing an XML file?

I am currently using the .iter method in the xlml.etree package for Python to parse an XML file. Is there a way to skip the first entry or start the iteration at a specific child using something like XPath?

I've investigated itertext and iterparse methods but I've been unsure based on their definitions that it will do much more than to help narrow down the iter to specific tags, which I've already done.

import lxml.etree as et

parsedXML = et.parse(file_path)

for child in parsedXML.iter('{http://www.witsml.org/schemas/131}data'):

The code is successful in parsing the XML file, but I'd like to reduce time by jumping past lines (which are all comma delimited) which are empty or lacking a sufficient number of characters.

<logData>
<data>63653079886,,,,,,,,,,,,,,,,,,,,,,,</data>
<data>63653079887,,,,,,,,,,,,,,,,,,,,,,,</data>
<data>63653079888,,,,,,,,,,,,,,,,,,,,,,,</data>
<data>63653079889,,,,,,,,,,,,,,,,,,,,,,,</data>
<data>63653079890,,29.3,155.8,12.25,0.0,0,0,95.31,-86.11,1729654,1202864,1319105,1.00,1.00,-511.4,1.95,74,0,0,264.1,3.4,,356.9</data>
<data>63653079891,,29.3,155.7,12.25,0.0,0,0,95.31,-86.11,1729654,1202864,1319105,1.00,1.00,-511.4,1.95,74,0,0,264.1,3.4,,356.9</data>
<data>63653079892,,29.3,155.8,12.25,0.0,0,0,93.76,-87.65,1729654,1202864,1319105,1.00,1.00,-511.4,1.95,74,0,0,264.1,3.4,,356.9</data>

There are lines and lines of rows which are empty, except for the 11 digit value on each line. I'd like to jump past that and start the iter at the row that first has the 12.25 value in this case (5th row in the example).

Since the data elements with only the 11 digit value and the commas (without any whitespace) is 34 characters, you can test the string length in a predicate :

data[string-length(translate(.,' ','')) > 34]

I used translate() to remove any whitespace before checking the string length.

Example...

XML Input (input.xml)

<logData>
    <data>63653079886,,,,,,,,,,,,,,,,,,,,,,,</data>
    <data>63653079887,,,,,,,,,,,,,,,,,,,,,,,</data>
    <data>63653079888,,,,,,,,,,,,,,,,,,,,,,,</data>
    <data>63653079889,,,,,,,,,,,,,,,,,,,,,,,</data>
    <data>63653079889, , , , , , , , , , , , , , , , , , , , , , ,</data>
    <data>63653079890,,29.3,155.8,12.25,0.0,0,0,95.31,-86.11,1729654,1202864,1319105,1.00,1.00,-511.4,1.95,74,0,0,264.1,3.4,,356.9</data>
    <data>63653079891,,29.3,155.7,12.25,0.0,0,0,95.31,-86.11,1729654,1202864,1319105,1.00,1.00,-511.4,1.95,74,0,0,264.1,3.4,,356.9</data>
    <data>63653079892,,29.3,155.8,12.25,0.0,0,0,93.76,-87.65,1729654,1202864,1319105,1.00,1.00,-511.4,1.95,74,0,0,264.1,3.4,,356.9</data>
</logData>

Python (I used XMLParser() to make the printed output nicer. It's not strictly necessary.)

from lxml import etree

parser = etree.XMLParser(remove_blank_text=True)

tree = etree.parse("input.xml", parser=parser)

for data in tree.xpath("data[string-length(translate(.,' ','')) > 34]"):
    print(etree.tostring(data).decode())

Output (printed to console)

<data>63653079890,,29.3,155.8,12.25,0.0,0,0,95.31,-86.11,1729654,1202864,1319105,1.00,1.00,-511.4,1.95,74,0,0,264.1,3.4,,356.9</data>
<data>63653079891,,29.3,155.7,12.25,0.0,0,0,95.31,-86.11,1729654,1202864,1319105,1.00,1.00,-511.4,1.95,74,0,0,264.1,3.4,,356.9</data>
<data>63653079892,,29.3,155.8,12.25,0.0,0,0,93.76,-87.65,1729654,1202864,1319105,1.00,1.00,-511.4,1.95,74,0,0,264.1,3.4,,356.9</data>

If you really wanted to test the 12.25 value, it's a little messy in a XPath 1.0 predicate when the string lengths of the values before it are unknown. You could do it with a series of substring-afters() 's inside a substring-before() . It's not pretty though...

xpath("data[substring-before(substring-after(substring-after(substring-after(substring-after(translate(.,' ',''),','),','),','),','),',') = '12.25']")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM