使用lxml和xpath从网页获取文本

Question

I'm trying to pull a number off of a webpage, specifically the current presidential approval rating from RealClearPolitics. 我正在尝试从网页中提取一些数字，特别是RealClearPolitics的当前总统批准等级。

Here's the code I'm using, trying to use urllib2 to get the webpage, lxml to parse it all, and using the xpath that chrome reports. 这是我正在使用的代码，尝试使用urllib2获取网页，使用lxml解析所有内容，并使用chrome报告的xpath。 Problem is, all I get at the end is an empty list. 问题是，我最后得到的只是一个空清单。

import urllib2
from lxml import etree

url = "http://www.realclearpolitics.com/epolls/other/president_obama_job_approval-1044.html"
page = urllib2.urlopen(url)

tree = etree.parse(page.content, etree.HTMLParser())

rcp=tree.xpath('//*[@id="polling-data-rcp"]/table/tbody/tr[2]/td[4]')

print rcp

Any help would be appreciated! 任何帮助，将不胜感激！

Answer 1

tr[2]/td[4] is not right. tr[2]/td[4]不正确。 See: 看到：

So you would need to use a correct XPath query: 因此，您需要使用正确的XPath查询：

And the Python code would be: Python代码将是：

import requests
from lxml import html

URL = "http://www.realclearpolitics.com/epolls/other/president_obama_job_approval-1044.html"
response = requests.get(URL)
tree = html.fromstring(response.content)

rcp_approve = '//table[@class="chart_legend small_legend"]/tbody/tr/td[@class="candidate"][1]/div[1]/span/text()'
rcp_disapprove = '//table[@class="chart_legend small_legend"]/tbody/tr/td[@class="candidate"][2]/div[1]/span/text()'

rcp_approve = float(tree.xpath(rcp_approve)[0])
rcp_disapprove = float(tree.xpath(rcp_disapprove)[0])

print "Obama's approve rate: {}".format(rcp_approve)
print "Obama's disapprove rate: {}".format(rcp_disapprove)

Output: 输出：

Obama's approve rate: 44.4
Obama's disapprove rate: 51.6

使用lxml和xpath从网页获取文本

问题描述

1 个解决方案

解决方案1
2 已采纳 2016-01-09 17:00:30

使用lxml和xpath从网页获取文本

问题描述

1 个解决方案

解决方案1 2 已采纳 2016-01-09 17:00:30

解决方案1
2 已采纳 2016-01-09 17:00:30