简体   繁体   English

使用不带根/父元素的lxml.etree

[英]Using lxml.etree with no root/parent element

I have some SGML that looks like this 我有一些看起来像这样的SGML

<!DOCTYPE sometype>
<ITEM>
<DATE>19-OCT-1987</DATE>
<TEXT>
<TITLE>I AM THE TITLE</TITLE>
<AUTHOR>I AM THE AUTHOR</AUTHOR>
<DATELINE>WHEN I WAS CREATED</DATELINE><BODY>
I WANT TO PRESERVE THIS TAG!
</BODY></TEXT>
</ITEM>
<ITEM>...

I tried to parse it with lxml.html, but it appears to strip the BODY tags, which I need to preserve. 我尝试使用lxml.html解析它,但是它似乎剥夺了我需要保留的BODY标签。 Next I tried to use lxml.etree, but as you can see there is not common parent element for all the ITEM tags. 接下来,我尝试使用lxml.etree,但是如您所见,所有ITEM标签都没有通用的父元素。 The code I'm currently using 我目前正在使用的代码

doc = """<!DOCTYPE sometype>
<ITEM>
<DATE>19-OCT-1987</DATE>
<TEXT>
<TITLE>I AM THE TITLE</TITLE>
<AUTHOR>I AM THE AUTHOR</AUTHOR>
<DATELINE>WHEN I WAS CREATED</DATELINE><BODY>
I WANT TO PRESERVE THIS TAG!
</BODY></TEXT>
</ITEM>"""

from lxml import etree
parser = etree.XMLParser(recover=True) # I have invalid HTML chars to ignore
sgml = etree.fromstring(doc, parser)

Now sgml is only the first ITEM element. 现在,sgml只是第一个ITEM元素。 I need it to be all of the ITEM elements. 我需要它成为所有ITEM元素。 Any ideas? 有任何想法吗? lxml.html does what I want, but it strips the BODY tags by default, and I haven't found a way to disable this behavior. lxml.html可以满足我的要求,但默认情况下会剥离BODY标记,但我还没有找到禁用此行为的方法。

There isn't a common parent element? 没有公共的父元素? Just make one! 只要做一个! You can just rewrite them to have a parent element, say ROOT. 您可以将它们重写为具有父元素,例如ROOT。 Insert <ROOT> before the first <ITEM> and </ROOT> at the end of the document. 在文档的第一个<ITEM> <ROOT>之前插入<ROOT> ,在文档的末尾插入<ROOT> </ROOT> It's pretty trivial to do programmatically, even if you have to preserve the actual on-disk content. 即使必须保留实际的磁盘内容,以编程方式进行操作也很简单。

eg. 例如。

<!DOCTYPE sometype>
<ROOT>
<ITEM>
<DATE>19-OCT-1987</DATE>
<TEXT>
<TITLE>I AM THE TITLE</TITLE>
<AUTHOR>I AM THE AUTHOR</AUTHOR>
<DATELINE>WHEN I WAS CREATED</DATELINE><BODY>
I WANT TO PRESERVE THIS TAG!
</BODY></TEXT>
</ITEM>
<ITEM>
<DATE>19-OCT-1879</DATE>
<TEXT>
<TITLE>I AM THE TITLE</TITLE>
<AUTHOR>I AM THE AUTHOR</AUTHOR>
<DATELINE>WHEN I WAS CREATED</DATELINE><BODY>
I WANT TO PRESERVE THIS TAG!
</BODY></TEXT>
</ITEM>
<ITEM>
<DATE>19-OCT-9871</DATE>
<TEXT>
<TITLE>I AM THE TITLE</TITLE>
<AUTHOR>I AM THE AUTHOR</AUTHOR>
<DATELINE>WHEN I WAS CREATED</DATELINE><BODY>
I WANT TO PRESERVE THIS TAG!
</BODY></TEXT>
</ITEM>
</ROOT>

I've just tried this and it seems to do what you want. 我刚刚尝试过,它似乎可以满足您的要求。 Saved as /tmp/goodfoo and loaded with lxml.etree.fromstring(allcontent) ; 另存为/ tmp / goodfoo并加载lxml.etree.fromstring(allcontent) ; then I accessed the text you say 'want to preserve' like this: b.getchildren()[0].getchildren()[-1].getchildren()[-1].text 然后我访问了您说“想保留”的文本,如下所示: b.getchildren()[0].getchildren()[-1].getchildren()[-1].text

(that is, get the first ITEM, get its TEXT element, get the TEXT element's BODY element, and return any text content of the BODY element.) (即,获取第一个ITEM,获取其TEXT元素,获取TEXT元素的BODY元素,然后返回BODY元素的任何文本内容。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM