I'm trying to pull a number off of a webpage, specifically the current presidential approval rating from RealClearPolitics.
Here's the code I'm using, trying to use urllib2 to get the webpage, lxml to parse it all, and using the xpath that chrome reports. Problem is, all I get at the end is an empty list.
import urllib2
from lxml import etree
url = "http://www.realclearpolitics.com/epolls/other/president_obama_job_approval-1044.html"
page = urllib2.urlopen(url)
tree = etree.parse(page.content, etree.HTMLParser())
rcp=tree.xpath('//*[@id="polling-data-rcp"]/table/tbody/tr[2]/td[4]')
print rcp
Any help would be appreciated!
tr[2]/td[4]
is not right. See:
So you would need to use a correct XPath query:
And the Python code would be:
import requests
from lxml import html
URL = "http://www.realclearpolitics.com/epolls/other/president_obama_job_approval-1044.html"
response = requests.get(URL)
tree = html.fromstring(response.content)
rcp_approve = '//table[@class="chart_legend small_legend"]/tbody/tr/td[@class="candidate"][1]/div[1]/span/text()'
rcp_disapprove = '//table[@class="chart_legend small_legend"]/tbody/tr/td[@class="candidate"][2]/div[1]/span/text()'
rcp_approve = float(tree.xpath(rcp_approve)[0])
rcp_disapprove = float(tree.xpath(rcp_disapprove)[0])
print "Obama's approve rate: {}".format(rcp_approve)
print "Obama's disapprove rate: {}".format(rcp_disapprove)
Output:
Obama's approve rate: 44.4
Obama's disapprove rate: 51.6
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.