What is the fastest way to extract content from XML document using LXML?

Question

I'm using LXML to extract information from a bunch of XML files. I'm wondering whether the way I'm approaching this task is the most efficient. Right now I use the xpath() method in LXML to identify the specific targets and then use different methods in lxml to extract this information.

As I had noticed in an earlier question ( Processing of XML files excruciatingly slow with LXML Python ) using etree.parse(file) or etree.parse(file).getroot() is very slow when the files get to be a certain size. They don't need to be very big, a 12MB xml file is already pretty slow.

What I'm wondering now is whether there is some alternative that might be faster. In the LXML documentation it says that using the XPath class might be faster than using the XPath() method. The problem I'm having is that the XPath class works with Element objects not with ElementTree objects which are what the etree.parse() produce.

All I need is some faster alternative to what I'm doing now, which is basically some variation of what follows. This is just one example of the many scripts of the same kind I use to extract information from the relevant XML files. Just in case you think it is the use of regular expressions that is responsible for the slowness, I've done tests where I use the XPath root_element.xpath('//tok[text()="lo"]') and no regex. The time it takes to process the 20 to 30MB files might be a bit less but not by a whole lot. Whatever I do with all those files, if it involves a for loop that checks an XPath expression and does something, it just takes a lot longer than what one would expect when using the latest Python and a Mac with the M1 max chip. I have an older laptop and when I try the same thing it takes 3 days!!!

XMLDIR = "/path_to_dir_with_xml_files"
myCSV_FILE = "/path_to_some_csv_file.csv"

ext = ".xml"


def xml_extract(root_element):

    for el in root_element.xpath('//tok[re:match(., "^[EeLl][LlOoAa][Ss]*$") and not(starts-with(@xpos, "D"))]',
        namespaces={"re": "http://exslt.org/regular-expressions"}): 

        target = el.text
        # allRelevantElements = el.xpath('preceding::tok[position() >= 1 and not(position() > 6)]/following::tok[position() >= 1 and not(position() > 6)]')
        RelevantPrecedingElements = el.xpath(
            "preceding::tok[position() >= 1 and not(position() > 6)]"
        )
        RelevantFollowingElements = el.xpath(
            "following::tok[position() >= 1 and not(position() > 6)]"
        )
        context_list = []

        for elem in RelevantPrecedingElements:
            elem_text = "".join(elem.itertext())
            assert elem_text is not None
            context_list.append(elem_text)

        # adjective = '<' + str(el.text) + '>'
        target = f"<{el.text}>"
        print(target)
        context_list.append(target)

        following_context = []
        for elem in RelevantFollowingElements:
            elem_text = "".join(elem.itertext())
            assert elem_text is not None
            following_context.append(elem_text)

        lema_fol = el.xpath('following::tok[1]')[0].get('lemma') if el.xpath('following::tok[1]') else None
        lema_prec = el.xpath('preceding::tok[1]')[0].get('lemma') if el.xpath('preceding::tok[1]') else None
        xpos_fol = el.xpath('following::tok[1]')[0].get('xpos') if el.xpath('following::tok[1]') else None
        xpos_prec = el.xpath('preceding::tok[1]')[0].get('xpos') if el.xpath('preceding::tok[1]') else None
        form_fol = el.xpath('following::tok[1]')[0].text if el.xpath('following::tok[1]') else None
        form_prec = el.xpath('preceding::tok[1]')[0].text if el.xpath('preceding::tok[1]') else None

        context = " ".join(context_list)
        print(f"Context is: {context}")


        llista = [
            context,
            lema_prec,
            xpos_prec,
            form_prec,
            target,
            lema_fol,
            xpos_fol,
            form_fol,
        ]

        writer = csv.writer(csv_file, delimiter=";")
        writer.writerow(llista)

with open(myCSV_FILE, "a+", encoding="UTF8", newline="") as csv_file:

    for root, dirs, files in os.walk(XMLDIR):

        for file in files:
            if file.endswith(ext):
                file_path = os.path.join(XMLDIR, file)
                file_root = et.parse(file_path).getroot()
                doc = file
                xml_extract(file_root)

Here's some example of a piece of XML document containing a match for the XPath expression I'm using. The function 'xml_extract' would be called on this match and the different pieces of information are correctly extracted and stored into the CSV file. This works fine and does what I want but it is way too slow.

<tok id="w-6387" ord="24" lemma="per" xpos="SPS00">per</tok>
<tok id="w-6388" ord="25" lemma="algun" xpos="DI0FP0">algunes</tok>
<tok id="w-6389" ord="26" lemma="franquesa" xpos="NCFP000">franqueses</tok>
<tok id="w-6390" nform="el" ord="27" lemma="el" xpos="L3MSA">lo</tok>
<tok id="w-6391" ord="28" lemma="haver" xpos="VMIP1S0">hac</tok>

EDIT:

To give some additional and relevant information that might be helpful to @ people trying to help me. The preceding XML content is rather straight forward but the structure of the documents can get rather complicated at times. I'm doing a study on medieval texts and the XML tags in these texts can contain different kinds of information. The 'tok' labels contain linguistic annotations which are the ones I'm interested in. In normal circumstances, the XML looks like the preceding sample. In some cases, however, the editors included other tags with metadata about the manuscripts (eg whether there was a modification or deletion by a scribe, whether there is a new section or a new page, the title of a section, etc). This can give you a sense of what can be found and perhaps help you understand why I'm using the approach I'm using. Most of the metadata is not relevant for me at this stage. What is relevant is the information contained in the 'dtok' tags. These are children tags of 'tok' whenever contracted forms have to be decomposed in independent words. This system allows for the visualization of contractions as single words but provides linguistic information about its components. The tagging has been done automatically but it is full of errors. One of the goals I have in extracting information is to be able to detect patterns that might help us improve the linguistic annotation in a semi-automatized way.

<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE document SYSTEM "estcorpus.dtd">
<TEI title="Full title" name="Doc_I44">
  <!--this file comes from the Stand-by folder: it needs to be checked because it has inaccurate xml tags-->
  <header>
    <filiation type="obra">Book title</filiation>
    <filiation type="autor">Author name</filiation>
    <filiation type="data">Segle XVIIa</filiation>
    <filiation type="tipologia">Letters</filiation>
    <filiation type="dialecte">Oc:V</filiation>
  </header>
  <text section="logic" lang="català" analyse="norest">
    <pb n="1r" type="folio" id="e-1" />
    <space />
    <space />
    <mark name="empty line" />
    <add>
      <tok form="IX" id="w-384" ord="1" lemma="IX" xpos="Z">IX</tok>
    </add>
    <mark name="lang=Latin" />
    <tok id="w-385" ord="2" lemma="morir" xpos="TMMS">Mort</tok>
    <tok id="w-386" ord="3" lemma="de" xpos="SPC00">de</tok>
    <tok id="w-387" ord="4" lemma="sant" xpos="NCMS000">sent</tok>
    <tok id="w-388" ord="5" lemma="Vicent" xpos="NP00000">Vicent</tok>
    <tok id="w-389" ord="6" lemma="Ferrer" xpos="NPCS00">Ferrer</tok>
    <tok id="w-99769" ord="23" xpos="CC" lemma="i">e</tok>
    <tok id="w-99770" ord="24" lemma="jo" xpos="PP1CSN00">jo</tok>
    <tok id="w-99771">dar-los 
    <dtok form="dar" id="d-99771-1" ord="25" lemma="dar" xpos="VMN0000" />
    <dtok form="los" id="d-99771-2" ord="26" lemma="els" xpos="L3CP0" /></tok>
    <tok id="w-99772" ord="27" lemma="haver" xpos="V0IF3S0">hé</tok>
    <tok id="w-99773" ord="28" lemma="diner" xpos="NCMP000">diners</tok>
    <space />
    <mark name="/lang" />
    <foreign name="Latin">
      <tok id="w-390" ord="7" lemma="any" xpos="CC">Annum</tok>
    </foreign>
  </text>
</TEI>

Is this the only way to go through an XML file using LXML or there is a faster way? Right now it takes 21 minutes to go through a 30MB file and getting the relevant information associated with the specific XPath expression. I'm using Python 3.11 and a pretty powerful computer. I cannot help but thinking that there must be some more efficient way to do what I'm doing. I have around 400 files in the directory. It takes forever every time that I have to go through them and do something.

Answer 1

If you will give a try for xml.etree.ElementTree you can do something similar like this:

import timeit

import xml.etree.ElementTree as ET
import pandas as pd

def main(file_path):
    tree = ET.parse(file_path)
    root = tree.getroot()

    columns = ['id', 'nform', 'ord', 'lemma', 'xpos', 'TAG_text']
    row = []
    for elem in root.iter('tok'):
        hit = (elem.get('id'),elem.get('nform'), elem.get('ord'), elem.get('lemma'), elem.get('xpos'), elem.text)
        row.append(hit)
        #print(elem.get('id'),elem.get('nform'), elem.get('ord'), elem.get('lemma'), elem.get('xpos'), elem.get('hac'), elem.text)
        
    df = pd.DataFrame(row, columns=columns)
    print(df)

if __name__ == '__main__':
    """ Input XML file definition """
    starttime=timeit.default_timer()

    file_path = 'LXML_dummy.xml'
    main(file_path)
    print()
    print('Finished')
    print("Runtime:", timeit.default_timer()-starttime)

Output:

       id nform ord      lemma     xpos    TAG_text
0  w-6387  None  24        per    SPS00         per
1  w-6388  None  25      algun   DI0FP0     algunes
2  w-6389  None  26  franquesa  NCFP000  franqueses
3  w-6390    el  27         el    L3MSA          lo
4  w-6391  None  28      haver  VMIP1S0         hac

Finished
Runtime: 0.008726199999728124

You can also try ET.iterparse() or ET.XMLPullParser() if you have realy huge XML files >1GB, depends on your machine ( doc ):

Example to change:

def main(file_path):
    
    events =('start','end','start-ns','end-ns')
    columns = ['id', 'nform', 'ord', 'lemma', 'xpos', 'TAG_text']

    row = []
    #ns ={"http://exslt.org/regular-expressions"}
    for event, elem in ET.iterparse(file_path, events=events):
        if event == 'end' and elem.tag == 'tok':
            hit = (elem.get('id'),elem.get('nform'), elem.get('ord'), elem.get('lemma'), elem.get('xpos'), elem.text)
            row.append(hit)
            #print(elem.get('id'),elem.get('nform'), elem.get('ord'), elem.get('lemma'), elem.get('xpos'), elem.get('hac'), elem.text)
        
    df = pd.DataFrame(row, columns=columns)
    print(df)

Answer 2

Based on the suggestion to compile XPath expression once and on my comment that you actually do all those preceding and following XPath evaluation several times instead of once I would try to use

match_xpath = et.XPath('//tok[re:match(., "^[EeLl][LlOoAa][Ss]*$") and not(starts-with(@xpos, "D"))]',
        namespaces={"re": "http://exslt.org/regular-expressions"})

preceding_xpath = et.XPath('preceding::tok[position() >= 1 and not(position() > 6)]')

following_xpath = et.XPath('following::tok[position() >= 1 and not(position() > 6)]')

def xml_extract(root_element):

    for el in match_xpath(root_element):

        target = el.text
        # allRelevantElements = el.xpath('preceding::tok[position() >= 1 and not(position() > 6)]/following::tok[position() >= 1 and not(position() > 6)]')
        RelevantPrecedingElements = preceding_xpath(el)
        prec1 = RelevantPrecedingElements[-1]
        RelevantFollowingElements = following_xpath(el)
        foll1 = RelevantFollowingElements[0]
        context_list = []

        for elem in RelevantPrecedingElements:
            elem_text = "".join(elem.itertext())
            assert elem_text is not None
            context_list.append(elem_text)

        # adjective = '<' + str(el.text) + '>'
        target = f"<{el.text}>"
        print(target)
        context_list.append(target)

        following_context = []
        for elem in RelevantFollowingElements:
            elem_text = "".join(elem.itertext())
            assert elem_text is not None
            following_context.append(elem_text)

        lema_fol = foll1.get('lemma') if foll1 is not None else None
        lema_prec = prec1.get('lemma') if prec1 is not None else None
        xpos_fol = foll1.get('xpos') if foll1 is not None else None
        xpos_prec = prec1.get('xpos') if prec1 is not None else None
        form_fol = foll1.text if foll1 is not None else None
        form_prec = prec1.text if prec1 is not None else None

        context = " ".join(context_list)
        print(f"Context is: {context}")


        llista = [
            context,
            lema_prec,
            xpos_prec,
            form_prec,
            target,
            lema_fol,
            xpos_fol,
            form_fol,
        ]

        writer = csv.writer(csv_file, delimiter=";")
        writer.writerow(llista)

with open(myCSV_FILE, "a+", encoding="UTF8", newline="") as csv_file:

    for root, dirs, files in os.walk(XMLDIR):
        for file in files:
            if file.endswith(ext):
                file_path = os.path.join(XMLDIR, file)
                file_root = et.parse(file_path).getroot()
                doc = file
                xml_extract(file_root)

and check whether that improves performance.

As an alternative, it would be interesting, first for a single file, how SaxonC ( https://saxonica.com/saxon-c/1199/ ) and XSLT 3 performs with an XSLT stylesheet like

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    exclude-result-prefixes="#all"
    version="3.0">
  
  <xsl:output method="text"/>

  <xsl:template match="/">
    <xsl:variable name="toks" select="//tok"/>
    <xsl:variable name="toks-id" select="$toks/generate-id()"/>
    <xsl:for-each select="$toks[matches(., '^[EeLl][LlOoAa][Ss]*$') and not(starts-with(@xpos, 'D'))]">
      <xsl:variable name="pos" select="index-of($toks-id, generate-id())"/>
      <xsl:variable name="prec-tok" select="$toks[$pos - 1]"/>
      <xsl:variable name="foll-tok" select="$toks[$pos + 1]"/>
      <xsl:value-of 
        select="string-join($toks[position() = ($pos - 5) to ($pos - 1)], ''), 
                $prec-tok/@lemma,
                $prec-tok/@xpos,
                $prec-tok,
                '&lt;' || . || '>',
                $foll-tok/@lemma,
                $foll-tok/@xpos,
                $foll-tok" 
                separator=";"/>
      <xsl:text>&#10;</xsl:text>
    </xsl:for-each>
  </xsl:template>
  
</xsl:stylesheet>

Of course XSLT 3 with Saxon can also easily process all .xml files in a directory to produce a single output file (eg 'csv'):

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    exclude-result-prefixes="#all"
    version="3.0">

  <xsl:param name="csv-path" select="'/path_to_some_csv_file.csv'"/>

  <xsl:param name="directory-uri" as="xs:string" select="'file://' || '/path_to_dir_with_xml_files'"/>

  <xsl:param name="ext" as="xs:string" select="'*.xml'"/>

  <xsl:template name="xsl:initial-template">
    <xsl:result-document href="file://{$csv-path}">
      <xsl:apply-templates select="uri-collection($directory-uri || '?select=' || $ext)!doc(.)"/>
    </xsl:result-document>
  </xsl:template>
  
  <xsl:output method="text"/>

  <xsl:template match="/">
    <xsl:variable name="toks" select="//tok"/>
    <xsl:variable name="toks-id" select="$toks/generate-id()"/>
    <xsl:for-each select="$toks[matches(., '^[EeLl][LlOoAa][Ss]*$') and not(starts-with(@xpos, 'D'))]">
      <xsl:variable name="pos" select="index-of($toks-id, generate-id())"/>
      <xsl:variable name="prec-tok" select="$toks[$pos - 1]"/>
      <xsl:variable name="foll-tok" select="$toks[$pos + 1]"/>
      <xsl:value-of 
        select="string-join($toks[position() = ($pos - 5) to ($pos - 1)], ''), 
                $prec-tok/@lemma,
                $prec-tok/@xpos,
                $prec-tok,
                '&lt;' || . || '>',
                $foll-tok/@lemma,
                $foll-tok/@xpos,
                $foll-tok" 
                separator=";"/>
      <xsl:text>&#10;</xsl:text>
    </xsl:for-each>
  </xsl:template>
  
</xsl:stylesheet>

That last example is untested.

What is the fastest way to extract content from XML document using LXML?

Question

2 answers

solution1
0 2023-01-05 22:47:21

solution2
0 2023-01-06 12:46:02

What is the fastest way to extract content from XML document using LXML?

Question

2 answers

solution1 0 2023-01-05 22:47:21

solution2 0 2023-01-06 12:46:02

solution1
0 2023-01-05 22:47:21

solution2
0 2023-01-06 12:46:02