简体   繁体   English

用于BeautifulSoup用户的html5lib / lxml示例?

[英]html5lib/lxml examples for BeautifulSoup users?

I'm trying to wean myself from BeautifulSoup, which I love but seems to be (aggressively) unsupported. 我正在尝试从BeautifulSoup中退出,我喜欢但它似乎(积极地)不受支持。 I'm trying to work with html5lib and lxml, but I can't seem to figure out how to use the "find" and "findall" operators. 我正在尝试使用html5lib和lxml,但似乎无法弄清楚如何使用“ find”和“ findall”运算符。

By looking at the docs for html5lib, I came up with this for a test program: 通过查看html5lib的文档,我想到了一个测试程序:

import cStringIO

f = cStringIO.StringIO()
f.write("""
  <html>
    <body>
      <table>
       <tr>
          <td>one</td>
          <td>1</td>
       </tr>
       <tr>
          <td>two</td>
          <td>2</td
       </tr>
      </table>
    </body>
  </html>
  """)
f.seek(0)

import html5lib
from html5lib import treebuilders
from lxml import etree  # why?

parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("lxml"))
etree_document = parser.parse(f)

root = etree_document.getroot()

root.find(".//tr")

But this returns None. 但这返回None。 I noticed that if I do a etree.tostring(root) I get all my data back, but all my tags are prefaced by html (eg <html:table> ). 我注意到,如果我执行etree.tostring(root)我会取回所有数据,但是我的所有标签都带有html开头(例如<html:table> )。 But root.find(".//html:tr") throws a KeyError. 但是root.find(".//html:tr")引发root.find(".//html:tr")

Can someone put me back on the right track? 有人可以让我回到正确的轨道吗?

您可以使用以下命令关闭名称空间: etree_document = html5lib.parse(t, treebuilder="lxml", namespaceHTMLElements=False)

In general, use lxml.html for HTML. 通常,将lxml.html用于HTML。 Then you don't need to worry about generating your own parser & worrying about namespaces. 然后,您不必担心生成自己的解析器和名称空间。

>>> import lxml.html as l
>>> doc = """
...    <html><body>
...    <table>
...      <tr>
...        <td>one</td>
...        <td>1</td>
...      </tr>
...      <tr>
...        <td>two</td>
...        <td>2</td
...      </tr>
...    </table>
...    </body></html>"""
>>> doc = l.document_fromstring(doc)
>>> doc.finall('.//tr')
[<Element tr at ...>, <Element tr at ...>] #doctest: +ELLIPSIS

FYI, lxml.html also allows you to use CSS selectors, which I find is an easier syntax. 仅供参考, lxml.html还允许您使用CSS选择器,我发现这是一种更简单的语法。

>>> doc.cssselect('tr')
[<Element tr at ...>, <Element tr at ...>] #doctest: +ELLIPSIS

It appears that using the "lxml" html5lib TreeBuilder causes html5lib to build the tree in the XHTML namespace -- which makes sense, as lxml is an XML library, and XHTML is how one represents HTML as XML. 看来使用“ lxml” html5lib TreeBuilder会导致html5lib在XHTML命名空间中构建树-这很有意义,因为lxml是XML库,而XHTML是将HTML表示为XML的方式。 You can use lxml's qname syntax with the find() method to do something like: 您可以将lxml的qname语法与find()方法一起使用,以执行以下操作:

root.find('.//{http://www.w3.org/1999/xhtml}tr')

Or you can use lxml's full XPath functions to do something like: 或者,您可以使用lxml的完整XPath函数来执行以下操作:

root.xpath('.//html:tr', namespaces={'html': 'http://www.w3.org/1999/xhtml'})

The lxml documentation has more information on how it uses XML namespaces. lxml文档提供了有关如何使用XML名称空间的更多信息。

I realize that this is an old question, but I came here in a quest for information I didn't find in any other one place. 我意识到这是一个古老的问题,但我来这里的目的是寻求在其他任何地方都找不到的信息。 I was trying to scrape something with BeautifulSoup but it was choking on some chunky html. 我试图用BeautifulSoup抓取一些东西,但是它在某些大块的html上令人窒息。 The default html parser is apparently less loose than some others that are available. 默认的html解析器显然不如其他可用的宽松。 One often preferred parser is lxml, which I believe produces the same parsing as expected for browsers. 一个通常首选的解析器是lxml,我相信它会产生与浏览器相同的解析。 BeautifulSoup allows you to specify lxml as the source parser, but using it requires a little bit of work. BeautifulSoup允许您将lxml指定为源解析器,但是使用它需要一些工作。

First, you need html5lib AND you must also install lxml. 首先,您需要html5lib并且还必须安装lxml。 While html5lib is prepared to use lxml (and some other libraries), the two do not come packaged together. 尽管html5lib准备使用lxml(和其他一些库),但两者并未打包在一起。 [for Windows users, even though I don't like fussing with Win dependencies to the extent that I usually get libraries by making a copy in the same directory as my project, I strongly recommend using pip for this; [对于Windows用户,即使我不喜欢对Win依赖关系大惊小怪,因为我通常通过在与项目相同的目录中进行复制来获得库,但我强烈建议为此使用pip; pretty painless; 很无痛 I think you need administrator access.] 我认为您需要管理员访问权限。]

Then you need to write something like this: 然后,您需要编写如下内容:

import urllib2
from bs4 import BeautifulSoup
import html5lib
from html5lib import sanitizer
from html5lib import treebuilders
from lxml import etree

url = 'http://...'

content = urllib2.urlopen(url)
parser = html5lib.HTMLParser(tokenizer=sanitizer.HTMLSanitizer,
                             tree=treebuilders.getTreeBuilder("lxml"),
                             namespaceHTMLElements=False)
htmlData = parser.parse(content)
htmlStr = etree.tostring(htmlData)

soup = BeautifulSoup(htmlStr, "lxml")

Then enjoy your beautiful soup! 然后享用您的美丽汤!

Note the namespaceHTMLElements=false option on the parser. 注意解析器上的namespaceHTMLElements = false选项。 This is important because lxml is intended for XML as opposed to just HTML. 这很重要,因为lxml是针对XML而不是HTML的。 Because of that, it will label all the tags it provides as belonging to the HTML namespace. 因此,它将标记它提供的所有标记属于HTML名称空间。 The tags will look like (for example) 标签看起来像(例如)

<html:li>

and BeautifulSoup will not work well. 和BeautifulSoup不能很好地工作。

Try: 尝试:

root.find('.//{http://www.w3.org/1999/xhtml}tr')

You have to specify the namespace rather than the namespace prefix ( html:tr ). 您必须指定名称空间,而不是名称空间前缀( html:tr )。 For more information, see the lxml docs, particularly the section: 有关更多信息,请参见lxml文档,尤其是本节:

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM