简体   繁体   English

使用 lxml.html 解析 HTML 时相当于 InnerHTML

[英]Equivalent to InnerHTML when using lxml.html to parse HTML

I'm working on a script using lxml.html to parse web pages.我正在编写一个使用 lxml.html 来解析网页的脚本。 I have done a fair bit of BeautifulSoup in my time but am now experimenting with lxml due to its speed.我以前做过很多 BeautifulSoup,但由于它的速度,现在正在尝试使用 lxml。

I would like to know what the most sensible way in the library is to do the equivalent of Javascript's InnerHtml - that is, to retrieve or set the complete contents of a tag.我想知道库中最明智的方法是执行与 Javascript 的 InnerHtml 等效的操作——即检索或设置标记的完整内容。

<body>
<h1>A title</h1>
<p>Some text</p>
</body>

InnerHtml is therefore:因此,InnerHtml 是:

<h1>A title</h1>
<p>Some text</p>

I can do it using hacks (converting to string/regexes etc) but I'm assuming that there is a correct way to do this using the library which I am missing due to unfamiliarity.我可以使用技巧(转换为字符串/正则表达式等)来做到这一点,但我假设有一种正确的方法可以使用我因不熟悉而丢失的库来做到这一点。 Thanks for any help.谢谢你的帮助。

EDIT: Thanks to pobk for showing me the way on this so quickly and effectively.编辑:感谢 pobk 如此快速有效地向我展示了这方面的方法。 For anyone trying the same, here is what I ended up with:对于任何尝试相同的人,这就是我最终得到的:

from lxml import html
from cStringIO import StringIO
t = html.parse(StringIO(
"""<body>
<h1>A title</h1>
<p>Some text</p>
Untagged text
<p>
Unclosed p tag
</body>"""))
root = t.getroot()
body = root.body
print (element.text or '') + ''.join([html.tostring(child) for child in body.iterdescendants()])

Note that the lxml.html parser will fix up the unclosed tag, so beware if this is a problem.请注意,lxml.html 解析器将修复未关闭的标记,因此请注意这是一个问题。

Sorry for bringing this up again, but I've been looking for a solution and yours contains a bug:很抱歉再次提出这个问题,但我一直在寻找解决方案,而您的解决方案包含一个错误:

<body>This text is ignored
<h1>Title</h1><p>Some text</p></body>

Text directly under the root element is ignored.根元素正下方的文本将被忽略。 I ended up doing this:我最终这样做了:

(body.text or '') +\
''.join([html.tostring(child) for child in body.iterchildren()])

You can get the children of an ElementTree node using the getchildren() or iterdescendants() methods of the root node:您可以使用根节点的 getchildren() 或 iterdescendants() 方法获取 ElementTree 节点的子节点:

>>> from lxml import etree
>>> from cStringIO import StringIO
>>> t = etree.parse(StringIO("""<body>
... <h1>A title</h1>
... <p>Some text</p>
... </body>"""))
>>> root = t.getroot()
>>> for child in root.iterdescendants(),:
...  print etree.tostring(child)
...
<h1>A title</h1>

<p>Some text</p>

This can be shorthanded as follows:这可以简写如下:

print ''.join([etree.tostring(child) for child in root.iterdescendants()])
import lxml.etree as ET

     body = t.xpath("//body");
     for tag in body:
         h = html.fromstring( ET.tostring(tag[0]) ).xpath("//h1");
         p = html.fromstring(  ET.tostring(tag[1]) ).xpath("//p");             
         htext = h[0].text_content();
         ptext = h[0].text_content();

you can also use .get('href') for a tag and .attrib for attribute,您还可以使用.get('href')作为标签, .attrib作为属性,

here tag no is hardcoded but you can also do this dynamic这里的标签号是硬编码的,但你也可以做这个动态的

Here is a Python 3 version:这是 Python 3 版本:

from xml.sax import saxutils
from lxml import html

def inner_html(tree):
    """ Return inner HTML of lxml element """
    return (saxutils.escape(tree.text) if tree.text else '') + \
        ''.join([html.tostring(child, encoding=str) for child in tree.iterchildren()])

Note that this includes escaping of the initial text as recommended by andreymal -- this is needed to avoid tag injection if you're working with sanitized HTML!请注意,这包括 andreymal 推荐的初始文本的escaping - 如果您使用经过清理的 HTML,则需要这样做以避免标签注入!

I find none of the answers satisfying, some are even in Python 2. So I add a one-liner solution that produces innerHTML-like output and works with Python 3:我发现没有一个答案令人满意,有些甚至在 Python 2 中也是如此。所以我添加了一个单行解决方案,它产生类似 innerHTML 的输出并与 Python 3 一起工作:

from lxml import etree, html

# generate some HTML element node
node = html.fromstring("""<container>
Some random text <b>bold <i>italic</i> yeah</b> no yeah
<!-- comment blah blah -->  <img src='gaga.png' />
</container>""")

# compute inner HTML of element
innerHTML = "".join([
    str(c) if type(c)==etree._ElementUnicodeResult 
    else html.tostring(c, with_tail=False).decode() 
    for c in node.xpath("node()")
]).strip()

The result will be:结果将是:

'Some random text <b>bold <i>italic</i> yeah</b> no yeah\n<!-- comment blah blah -->  <img src="gaga.png">'

What it does: The xpath delivers all node children (text, elements, comments).它的作用:xpath 传递所有节点子节点(文本、元素、注释)。 The list comprehension produces a list of the text contents of the text nodes and HTML content of element nodes.列表理解生成文本节点的文本内容和元素节点的 HTML 内容的列表。 Those are then joined into a single string.然后将它们连接成一个字符串。 If you want to get rid of comments, use *|text() instead of node() for xpath.如果你想摆脱评论,请使用*|text()而不是node()用于 xpath。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM