简体   繁体   中英

Using lxml.etree with no root/parent element

I have some SGML that looks like this

<!DOCTYPE sometype>
<ITEM>
<DATE>19-OCT-1987</DATE>
<TEXT>
<TITLE>I AM THE TITLE</TITLE>
<AUTHOR>I AM THE AUTHOR</AUTHOR>
<DATELINE>WHEN I WAS CREATED</DATELINE><BODY>
I WANT TO PRESERVE THIS TAG!
</BODY></TEXT>
</ITEM>
<ITEM>...

I tried to parse it with lxml.html, but it appears to strip the BODY tags, which I need to preserve. Next I tried to use lxml.etree, but as you can see there is not common parent element for all the ITEM tags. The code I'm currently using

doc = """<!DOCTYPE sometype>
<ITEM>
<DATE>19-OCT-1987</DATE>
<TEXT>
<TITLE>I AM THE TITLE</TITLE>
<AUTHOR>I AM THE AUTHOR</AUTHOR>
<DATELINE>WHEN I WAS CREATED</DATELINE><BODY>
I WANT TO PRESERVE THIS TAG!
</BODY></TEXT>
</ITEM>"""

from lxml import etree
parser = etree.XMLParser(recover=True) # I have invalid HTML chars to ignore
sgml = etree.fromstring(doc, parser)

Now sgml is only the first ITEM element. I need it to be all of the ITEM elements. Any ideas? lxml.html does what I want, but it strips the BODY tags by default, and I haven't found a way to disable this behavior.

There isn't a common parent element? Just make one! You can just rewrite them to have a parent element, say ROOT. Insert <ROOT> before the first <ITEM> and </ROOT> at the end of the document. It's pretty trivial to do programmatically, even if you have to preserve the actual on-disk content.

eg.

<!DOCTYPE sometype>
<ROOT>
<ITEM>
<DATE>19-OCT-1987</DATE>
<TEXT>
<TITLE>I AM THE TITLE</TITLE>
<AUTHOR>I AM THE AUTHOR</AUTHOR>
<DATELINE>WHEN I WAS CREATED</DATELINE><BODY>
I WANT TO PRESERVE THIS TAG!
</BODY></TEXT>
</ITEM>
<ITEM>
<DATE>19-OCT-1879</DATE>
<TEXT>
<TITLE>I AM THE TITLE</TITLE>
<AUTHOR>I AM THE AUTHOR</AUTHOR>
<DATELINE>WHEN I WAS CREATED</DATELINE><BODY>
I WANT TO PRESERVE THIS TAG!
</BODY></TEXT>
</ITEM>
<ITEM>
<DATE>19-OCT-9871</DATE>
<TEXT>
<TITLE>I AM THE TITLE</TITLE>
<AUTHOR>I AM THE AUTHOR</AUTHOR>
<DATELINE>WHEN I WAS CREATED</DATELINE><BODY>
I WANT TO PRESERVE THIS TAG!
</BODY></TEXT>
</ITEM>
</ROOT>

I've just tried this and it seems to do what you want. Saved as /tmp/goodfoo and loaded with lxml.etree.fromstring(allcontent) ; then I accessed the text you say 'want to preserve' like this: b.getchildren()[0].getchildren()[-1].getchildren()[-1].text

(that is, get the first ITEM, get its TEXT element, get the TEXT element's BODY element, and return any text content of the BODY element.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM