简体   繁体   English

如何在将pyquery对象转换为字符串时取消特殊字符

[英]How to unescape special characters while converting pyquery object to string

I am trying to fetch a remote page with python requests module, reconstruct a DOM tree, do some processing and save the result to file. 我试图用python请求模块获取远程页面,重建DOM树,做一些处理并将结果保存到文件。 When I fetch a page and then just write it to the file everything works (I can open an html file later in the browser and it is rendered correctly). 当我获取页面然后将其写入文件时,一切正常(我可以稍后在浏览器中打开一个html文件并正确呈现)。

However, if I create a pyquery object and do some processing and then save it by using str conversion it fails. 但是,如果我创建一个pyquery对象并进行一些处理然后使用str转换保存它就会失败。 Specifically, special characters like && and etc. get modified within script tags of the saved source (caused by application of pyquery) and it prevents page from rendering correctly. 具体来说,像保存源的脚本标记(由pyquery的应用程序引起)中的特殊字符(如&&等)会被修改,并且会阻止页面正确呈现。

Here is my code: 这是我的代码:

import requests
from lxml import etree
from pyquery import PyQuery as pq

user_agent = {'User-agent': 'Mozilla/5.0'}
r = requests.get('http://www.google.com',headers=user_agent, timeout=4)

DOM = pq(r.text)
#some optional processing
fTest = open("fTest.html","wb")
fTest.write(str(DOM))
fTest.close()

So, the question is: How to make sure that special characters aren't escaped after application of pyquery? 所以,问题是: 如何确保在应用pyquery后不转义特殊字符? I suppose it might be related to lxml (parent library for pyquery), but after tedious search online and experiments with different ways of object serialization I still didn't make it. 我想它可能与lxml(pyquery的父库)有关,但是在网上繁琐的搜索和不同的对象序列化方法的实验后,我仍然没有做到。 Maybe this is also related to unicode handling?! 也许这也与unicode处理有关?!

Many thanks in advance! 提前谢谢了!

I have found an elegant solution to the problem and the reason why it the code didn't work before. 我找到了一个优雅的问题解决方案,以及之前代码无效的原因。

First, you can read carefully the page with http://lxml.de/lxmlhtml.html . 首先,您可以使用http://lxml.de/lxmlhtml.html仔细阅读该页面。 It has a section "Creating HTML with the E-factory" . 它有一节“使用电子工厂创建HTML” After the section they point out to the fact that etree.tostring() method works for XML only. 在该部分之后,他们指出etree.tostring() method仅适用于XML。 But for HTML with additional possibility to have script or style tags it will mess things around. 但对于HTML而言,如果有更多可能使用脚本或样式标签的话,那么它就会乱七八糟。 So.. Second, the solution is to use the overloaded method html.tostring() . 所以..其次,解决方案是使用重载方法html.tostring()

The final working code is: 最终的工作代码是:

# for networking
import requests
# for parsing and serialization
from lxml import etree
from lxml.html import tostring as html2str # IMPORTANT!!!
from pyquery import PyQuery as pq

user_agent = {'User-agent': 'Mozilla/5.0'}
r = requests.get('http://www.google.com',headers=user_agent, timeout=4)

# construct DOM object
DOM = pq(r.text)
# do stuff with DOM
#
# save result to file
fTest = open("fTest.html","wb")
fTest.write(html2str(DOM.root)) # IMPORTANT!!!
fTest.close()

Hope it will save time some of you in future! 希望将来能帮你们节省一些时间! Have fun guys! 玩得开心! ;) ;)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM