简体   繁体   English

使用lxml和xpath从网页获取文本

[英]Using lxml and xpath to get text from a webpage

I'm trying to pull a number off of a webpage, specifically the current presidential approval rating from RealClearPolitics. 我正在尝试从网页中提取一些数字,特别是RealClearPolitics的当前总统批准等级。

Here's the code I'm using, trying to use urllib2 to get the webpage, lxml to parse it all, and using the xpath that chrome reports. 这是我正在使用的代码,尝试使用urllib2获取网页,使用lxml解析所有内容,并使用chrome报告的xpath。 Problem is, all I get at the end is an empty list. 问题是,我最后得到的只是一个空清单。

import urllib2
from lxml import etree

url = "http://www.realclearpolitics.com/epolls/other/president_obama_job_approval-1044.html"
page = urllib2.urlopen(url)

tree = etree.parse(page.content, etree.HTMLParser())

rcp=tree.xpath('//*[@id="polling-data-rcp"]/table/tbody/tr[2]/td[4]')

print rcp

Any help would be appreciated! 任何帮助,将不胜感激!

tr[2]/td[4] is not right. tr[2]/td[4]不正确。 See: 看到:

在此处输入图片说明

So you would need to use a correct XPath query: 因此,您需要使用正确的XPath查询:

在此处输入图片说明

And the Python code would be: Python代码将是:

import requests
from lxml import html

URL = "http://www.realclearpolitics.com/epolls/other/president_obama_job_approval-1044.html"
response = requests.get(URL)
tree = html.fromstring(response.content)

rcp_approve = '//table[@class="chart_legend small_legend"]/tbody/tr/td[@class="candidate"][1]/div[1]/span/text()'
rcp_disapprove = '//table[@class="chart_legend small_legend"]/tbody/tr/td[@class="candidate"][2]/div[1]/span/text()'

rcp_approve = float(tree.xpath(rcp_approve)[0])
rcp_disapprove = float(tree.xpath(rcp_disapprove)[0])

print "Obama's approve rate: {}".format(rcp_approve)
print "Obama's disapprove rate: {}".format(rcp_disapprove)

Output: 输出:

Obama's approve rate: 44.4
Obama's disapprove rate: 51.6

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM