[英]Parsing html and js in python using lxml
I'm having trouble parsing JS using lxml in Python. 我在Python中使用lxml解析JS时遇到问题。 When I execute the code below, my output is: 当我执行以下代码时,我的输出是:
"< Element div at 0x10cec4e10 >" “ <位于0x10cec4e10处的元素div>”
from lxml.html.clean import Cleaner
cleaner = Cleaner()
cleaner.javascript = True
text = urllib2.urlopen("URL").read().decode("utf-8")
test = lxml.html.fromstring(cleaner.clean_html(text))
print test
What I'm trying to get is the parsed text without the js stuff. 我想要得到的是没有js内容的解析文本。 Can someone shed some light? 有人可以照亮吗? Thanks. 谢谢。
import lxml
import urllib2
URL = "http://www.google.com/"
ENCODING = "latin1"
args = {
"javascript": True, # strip javascript
"page_structure": False, # leave page structure alone
"style": True # remove CSS styling
}
cleaner = lxml.html.clean.Cleaner(**args)
# get the page source
html = urllib2.urlopen(URL).read().decode(ENCODING)
# clean it up
clean = cleaner.clean_html(html)
# print unformatted html dump
print(clean)
# print properly indented html
tree = lxml.html.fromstring(clean)
print(lxml.etree.tostring(tree, pretty_print=True))
Note that pretty-printing works properly with lxml.etree.tostring(), but poorly with lxml.html.tostring(), which does linebreaks but not indenting - go figure. 请注意,漂亮的打印可以使用lxml.etree.tostring()正常工作,但不能很好地与lxml.html.tostring()一起工作,后者可以换行,但不能缩进-仔细观察。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.