简体   繁体   中英

Rewrite xml and save context

I've XML strings like the following:

xml = """
<body>
    <head>1. Un livre sur <persName type="author" key="Ronsard, Pierre de (1524-1585)" ref="http://www.idref.fr/027107957">Ronsard</persName></head>
    <head>2. <title>La pitié des églises</title> par <persName key="Barrès, Maurice (1862-1923)" ref="http://www.idref.fr/026706601" type="author">Barrès</persName></head>
</body>
"""

I've some function called processLine(line) that takes a whole line ( text within <head> without tags), in my case these two lines will be processed by the processLine function:

1. Un livre sur Ronsard
2. La pitié des églises par Barrès

and concatenate a certain string to some words of each line, for example:

"Ronsard" becomes "Ronsard I-PER"
"Barrès"  becomes "Barrès I-PER"

Here is the code I've made so far using Python's etree library:

from lxml import etree

root = etree.fromstring(xml)
pars = root.xpath('//body//head')

for par in pars:
    line = par.text # return the line stripped from tags
    processLine( line ) 

My Question: How can I save those changes in the xml file, without loosing its structure ?

ie: My new XML file in my exemple will become:

newxml = """
<body>
    <head>1. Un livre sur <persName type="author" key="Ronsard, Pierre de (1524-1585)" ref="http://www.idref.fr/027107957">Ronsard I-PER</persName></head>
    <head>2. <title>La pitié des églises</title> par <persName key="Barrès, Maurice (1862-1923)" ref="http://www.idref.fr/026706601" type="author">Barrès I-PER</persName></head>
</body>
"""

You can set the tag' text property to what you need and then just call etree.tostring(rootElt, prettyPrint = True) .

Yeah, and note: I'm selecting all the <persName> tags, not all the headings itselves:

pars = root.xpath('//body//head//persName')

Check this out:

from lxml import etree

xml = """
<body>
    <head>1. Un livre sur <persName type="author" key="Ronsard, Pierre de (1524-1585)" ref="http://www.idref.fr/027107957">Ronsard</persName></head>
    <head>2. <title>La pitié des églises</title> par <persName key="Barrès, Maurice (1862-1923)" ref="http://www.idref.fr/026706601" type="author">Barrès</persName></head>
</body>
"""

root = etree.fromstring(xml)
pars = root.xpath('//body//head//persName')

for par in pars:
    line = par.text # return the line stripped from tags
    processLine( line ) 

    par.text = par.text + ' I-PER'

print(etree.tostring(root, unicode = True, pretty_print = True))

This prints the following XML:

<body>
    <head>1. Un livre sur <persName type="author" key="Ronsard, Pierre de (1524-1585)" ref="http://www.idref.fr/027107957">Ronsard I-PER</persName></head>
    <head>2. <title>La pitié des églises</title> par <persName key="Barrès, Maurice (1862-1923)" ref="http://www.idref.fr/026706601" type="author">Barrès I-PER</persName></head>
</body>

If you want to process all the headings and only then process names - may be you want to select inner tag ( persName ) from heading tag itself ( head )?

for par in pars:
    # ...

    pers = par.xpath('//persName')

    for per in pers:
        per.text = per.text + ' I-PER'

This code gives exactly the same result, but within the processLine function you will still deal with the whole <head> tag, whilst pers variable will contain all that tag's <persName> children.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM