使用LXML从网页中获取特定值

Question

I am trying to grab a value from a page using LXML and Python. 我正在尝试使用LXML和Python从页面中获取值。

I followed some basic examples which worked. 我遵循了一些有效的基本示例。 But I'm struggling to get the text from quite a complex (to me at least) web page. 但是我正在努力从相当复杂的网页（至少对我而言）获取文本。

I want to grab the number of followers from this page: http://twitter.com/aberdeencc 我想从此页面获取更多关注者： http : //twitter.com/aberdeencc

I want the exact value of followers which (at the time of writing is 10,623 - not the displayed 10.6K. The exact value is just shown as a tooltip-style mouseover. 我想要关注者的确切值（在撰写本文时为10,623-不是显示的10.6K。确切值仅显示为工具提示样式的鼠标悬停。

Looking at the page code it is in this section: 查看此部分中的页面代码：

<a class="ProfileNav-stat ProfileNav-stat--link u-borderUserColor u-textCenter js-tooltip js-openSignupDialog js-nonNavigable u-textUserColor" data-nav="followers" 
   href="/AberdeenCC/followers" data-original-title="10,623 Followers">
       <span class="ProfileNav-label">Followers</span>
       <span class="ProfileNav-value" data-is-compact="true">10.6K</span>
</a>

The code I have is 我的代码是

from lxml import html

import requests

page = requests.get('http://twitter.com/aberdeencc')

tree = html.fromstring(page.text)

followers = tree.xpath('//span[@class="ProfileNav-stat ProfileNav-stat--link 

u-borderUserColor u-textCenter js-tooltip js-openSignupDialog js-nonNavigable 

u-textUserColor"]/text()')

print 'Followers: ', followers

But that returns an empty list. 但这返回一个空列表。

(I know a don't need a list for a single value, but I'm working from existing code) （我知道不需要单个值的列表，但我正在使用现有代码）

Thanks for any pointers you can give 感谢您提供的任何指导

Watty 瓦蒂

Answer 1

>>> from lxml import etree
>>> import requests
>>> page = requests.get("https://twitter.com/aberdeencc")
>>> doc = etree.HTML(page.text)
>>> doc.xpath('//a[@data-nav="followers"]/@title')
['10,623 Followers']

Answer 2

I'd advise agains using xpath in that particular case. 我建议在这种特殊情况下再次使用xpath 。 I think the CSS selector API is better suited for that case. 我认为CSS选择器API更适合这种情况。 This should work: 这应该工作：

followers = tree.cssselect("a.ProfileNav-stat")[0].attrib["data-original-title"]
# followers = '10,623 Followers'

This method requires cssselect to be installed. 此方法需要安装cssselect 。

Answer 3

I'd rely on the data-nav attribute instead and get the value of the title attribute: 我将改用data-nav属性，并获取title属性的值：

from lxml import html
import requests


page = requests.get('http://twitter.com/aberdeencc')
tree = html.fromstring(page.text)

followers = tree.xpath('//a[@data-nav="followers"]/@title')
print 'Followers: ', followers

Prints: 印刷品：

Followers:  ['10,623 Followers']

In order, to exract the actual number from the followers , you can use a regular expression and then parse the string to int using locale.atoi() : 为了从followers提取实际数字，您可以使用正则表达式，然后使用locale.atoi()将字符串解析为int ：

import locale
import re
from lxml import html
import requests


locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')

page = requests.get('http://twitter.com/aberdeencc')
tree = html.fromstring(page.text)

followers = tree.xpath('//a[@data-nav="followers"]/@title')[0]
followers = re.match(r'^([0-9,]+)\sFollowers$', followers).group(1)
followers = locale.atoi(followers)

print 'Followers:', int(followers)

Prints: 印刷品：

Followers: 10623

Besides, twitter provides an API which you can use through the python interface, there are multiple options to choose from: 此外，twitter提供了可以通过python接口使用的API ，有多种选项可供选择：

使用LXML从网页中获取特定值

问题描述

3 个解决方案

解决方案1
0 2014-08-22 13:08:35

解决方案2
0 2014-08-22 13:10:44

解决方案3
0 已采纳 2014-08-22 13:12:18

使用LXML从网页中获取特定值

问题描述

3 个解决方案

解决方案1 0 2014-08-22 13:08:35

解决方案2 0 2014-08-22 13:10:44

解决方案3 0 已采纳 2014-08-22 13:12:18

解决方案1
0 2014-08-22 13:08:35

解决方案2
0 2014-08-22 13:10:44

解决方案3
0 已采纳 2014-08-22 13:12:18