简体   繁体   English

使用lxml,导致“lxml.etree.XMLSyntaxError:Document is empty”错误的原因是什么?

[英]Using lxml, what causes a “lxml.etree.XMLSyntaxError: Document is empty” error?

I'm using mechanize/cookiejar/lxml to read a page and it works for some but not others. 我正在使用mechanize / cookiejar / lxml来读取页面,它适用于某些页面,但不适用于其他页面。 The error I'm getting in them is the one in the title. 我遇到的错误是标题中的错误。 I can't post the pages here because they aren't SFW, but is there a way to fix it? 我不能在这里发布页面,因为它们不是SFW,但有没有办法解决它? Basically, this is what I do: 基本上,这就是我所做的:

import mechanize, cookielib
from lxml import etree    

br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)

br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(False)
br.set_handle_robots(False)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.13) Gecko/20101206 Ubuntu/10.10 maverick Firefox/3.6.13')]

response = br.open('...')
tree = etree.parse(response) #error

After that I get the root and search the document for the values I want. 之后我得到root并在文档中搜索我想要的值。 Apparently iterparse doesn't crash it, but at the moment I'm assuming it doesn't just because I didn't process anything with it. 显然iterparse不会崩溃它,但目前我假设它不仅仅是因为我没有用它处理任何东西。 Plus, I haven't figured out yet how to search for the stuff with it. 另外,我还没有弄清楚如何用它搜索这些东西。

I've tried disabling gzip and enabling sending the referer as well but neither solves the problem. 我已经尝试禁用gzip并启用发送引用,但都没有解决问题。 I also tried saving the sourcecode to the disk and creating the tree from there just for the sake of it and I get the same error. 我也尝试将源代码保存到磁盘并从那里创建树只是为了它,我得到相同的错误。

edit 编辑
The response I get seems to be fine, using print repr(response) as suggested I get a <response_seek_wrapper at 0xa4a160c whose wrapped object = <stupid_gzip_wrapper at 0xa49acec whose fp = <socket._fileobject object at 0xa49c32c>>> . 我得到的响应似乎很好,使用print repr(响应),我建议<response_seek_wrapper at 0xa4a160c whose wrapped object = <stupid_gzip_wrapper at 0xa49acec whose fp = <socket._fileobject object at 0xa49c32c>>>得到一个<response_seek_wrapper at 0xa4a160c whose wrapped object = <stupid_gzip_wrapper at 0xa49acec whose fp = <socket._fileobject object at 0xa49c32c>>> I can also save the response using the read() method and check that the saved .xml works on the browser and everything. 我还可以使用read()方法保存响应,并检查保存的.xml是否适用于浏览器和所有内容。

Also, in one of the pages, there is a &rsquo; 此外,在其中一个页面中,有一个&rsquo; that gives me the following error: "lxml.etree.XMLSyntaxError: Entity 'rsquo' not defined, line 17, column 7054". 这给了我以下错误:“lxml.etree.XMLSyntaxError:实体'rsquo'未定义,第17行,第7054行”。 So far I've replaced it with a regex, but is there a parser that can handle this? 到目前为止,我已经用正则表达式替换它,但是有一个解析器可以处理这个吗? I've gotten this error even with the lxml.html.parse suggested below. 即使使用下面建议的lxml.html.parse,我也会遇到此错误。

Regarding the file being highlighted, I meant that when I open it with gEdit it does this kinda: http://img34.imageshack.us/img34/9574/gedit.jpg 关于正在突出显示的文件,我的意思是当我用gEdit打开它时,它确实如此: http ://img34.imageshack.us/img34/9574/gedit.jpg

使用lxml.html.parse为html它可以处理甚至非常破碎的html,你仍然得到一个错误呢?

What is the nature of response ? response的本质是什么? According to the help, etree.parse is expecting one of: 根据帮助,etree.parse期待以下之一:

   - a file name/path
   - a file object
   - a file-like object
   - a URL using the HTTP or FTP protocol

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM