使用不带根/父元素的lxml.etree

Question

I have some SGML that looks like this 我有一些看起来像这样的SGML

<!DOCTYPE sometype>
<ITEM>
<DATE>19-OCT-1987</DATE>
<TEXT>
<TITLE>I AM THE TITLE</TITLE>
<AUTHOR>I AM THE AUTHOR</AUTHOR>
<DATELINE>WHEN I WAS CREATED</DATELINE><BODY>
I WANT TO PRESERVE THIS TAG!
</BODY></TEXT>
</ITEM>
<ITEM>...

I tried to parse it with lxml.html, but it appears to strip the BODY tags, which I need to preserve. 我尝试使用lxml.html解析它，但是它似乎剥夺了我需要保留的BODY标签。 Next I tried to use lxml.etree, but as you can see there is not common parent element for all the ITEM tags. 接下来，我尝试使用lxml.etree，但是如您所见，所有ITEM标签都没有通用的父元素。 The code I'm currently using 我目前正在使用的代码

doc = """<!DOCTYPE sometype>
<ITEM>
<DATE>19-OCT-1987</DATE>
<TEXT>
<TITLE>I AM THE TITLE</TITLE>
<AUTHOR>I AM THE AUTHOR</AUTHOR>
<DATELINE>WHEN I WAS CREATED</DATELINE><BODY>
I WANT TO PRESERVE THIS TAG!
</BODY></TEXT>
</ITEM>"""

from lxml import etree
parser = etree.XMLParser(recover=True) # I have invalid HTML chars to ignore
sgml = etree.fromstring(doc, parser)

Now sgml is only the first ITEM element. 现在，sgml只是第一个ITEM元素。 I need it to be all of the ITEM elements. 我需要它成为所有ITEM元素。 Any ideas? 有任何想法吗？ lxml.html does what I want, but it strips the BODY tags by default, and I haven't found a way to disable this behavior. lxml.html可以满足我的要求，但默认情况下会剥离BODY标记，但我还没有找到禁用此行为的方法。

Answer 1

There isn't a common parent element? 没有公共的父元素？ Just make one! 只要做一个！ You can just rewrite them to have a parent element, say ROOT. 您可以将它们重写为具有父元素，例如ROOT。 Insert <ROOT> before the first <ITEM> and </ROOT> at the end of the document. 在文档的第一个<ITEM> <ROOT>之前插入<ROOT> ，在文档的末尾插入<ROOT> </ROOT> 。 It's pretty trivial to do programmatically, even if you have to preserve the actual on-disk content. 即使必须保留实际的磁盘内容，以编程方式进行操作也很简单。

eg. 例如。

<!DOCTYPE sometype>
<ROOT>
<ITEM>
<DATE>19-OCT-1987</DATE>
<TEXT>
<TITLE>I AM THE TITLE</TITLE>
<AUTHOR>I AM THE AUTHOR</AUTHOR>
<DATELINE>WHEN I WAS CREATED</DATELINE><BODY>
I WANT TO PRESERVE THIS TAG!
</BODY></TEXT>
</ITEM>
<ITEM>
<DATE>19-OCT-1879</DATE>
<TEXT>
<TITLE>I AM THE TITLE</TITLE>
<AUTHOR>I AM THE AUTHOR</AUTHOR>
<DATELINE>WHEN I WAS CREATED</DATELINE><BODY>
I WANT TO PRESERVE THIS TAG!
</BODY></TEXT>
</ITEM>
<ITEM>
<DATE>19-OCT-9871</DATE>
<TEXT>
<TITLE>I AM THE TITLE</TITLE>
<AUTHOR>I AM THE AUTHOR</AUTHOR>
<DATELINE>WHEN I WAS CREATED</DATELINE><BODY>
I WANT TO PRESERVE THIS TAG!
</BODY></TEXT>
</ITEM>
</ROOT>

I've just tried this and it seems to do what you want. 我刚刚尝试过，它似乎可以满足您的要求。 Saved as /tmp/goodfoo and loaded with lxml.etree.fromstring(allcontent) ; 另存为/ tmp / goodfoo并加载lxml.etree.fromstring(allcontent) ; then I accessed the text you say 'want to preserve' like this: b.getchildren()[0].getchildren()[-1].getchildren()[-1].text 然后我访问了您说“想保留”的文本，如下所示： b.getchildren()[0].getchildren()[-1].getchildren()[-1].text

(that is, get the first ITEM, get its TEXT element, get the TEXT element's BODY element, and return any text content of the BODY element.) （即，获取第一个ITEM，获取其TEXT元素，获取TEXT元素的BODY元素，然后返回BODY元素的任何文本内容。）

使用不带根/父元素的lxml.etree

问题描述

1 个解决方案

解决方案1
1 2013-05-17 00:47:33

使用不带根/父元素的lxml.etree

问题描述

1 个解决方案

解决方案1 1 2013-05-17 00:47:33

解决方案1
1 2013-05-17 00:47:33