[英]XPath always returns an empty list
Well, I don't know why the 'title_List' always return none. 好吧,我不知道为什么“ title_List”总是不返回任何内容。
I just have tried to change the "User-Agent" but the result is the same. 我只是尝试更改“用户代理”,但结果是相同的。
Can anybody tell me where is wrong with my code? 有人可以告诉我我的代码哪里出问题了吗?
And the Xpath is right by using chrome xpath-helper like following img. 通过使用chrome xpath-helper(如下面的img),Xpath是正确的。
This is my code: 这是我的代码:
#coding=utf-8
import re
import urllib2
import urllib
from lxml import etree
def init():
url = 'https://tieba.baidu.com/f?kw=%E7%BE%8E%E5%A5%B3&ie=utf-8&pn=0'
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50"}
request = urllib2.Request(url, headers=headers)
response = urllib2.urlopen(request).read()
print(1)
print(response)
#shape response get data
get_title(response)
print(4)
#get title href
def get_title(response):
#html->xpath
html_dom = etree.HTML(response)
ts = html_dom.xpath('//div[@class="threadlist_lz clearfix"]/div/a[@class="j_th_tit"]/@href')
print(2)
print(ts)
for href in ts:
full_link='https://tieba.baidu.com'+str(href)
print(3)
print(full_link)
Result:(i have deleted some codes because the limitation!) 结果:(由于限制我删除了一些代码!)
1
<!DOCTYPE html>
<!--STATUS OK-->
<html>
...
<div class="threadlist_lz clearfix">
<div class="threadlist_title pull_left j_th_tit
">
<i class="icon-member-top" alt="会员置顶" title="会员置顶" ></i><i class="icon-good" alt="精品" title="精品" ></i>
<a rel="noreferrer" href="/p/5006374769" title="【答疑解惑】误删误封绿色通道" target="_blank" class="j_th_tit ">【答疑解惑】误删误封绿色通道</a>
</div><div class="threadlist_author pull_right">
...
2
[]
4
The @class attribute of your XPath expression is wrong. XPath表达式的@class属性错误。 Change it to
j_th_tit
(with a trailing space) and it will match. 将其更改为
j_th_tit
(带有尾随空格),它将匹配。
//div[@class="threadlist_lz clearfix"]/div/a[@class="j_th_tit "]/@href
To avoid these errors, it's often better to use the contains(...)
function like 为了避免这些错误,通常最好使用
contains(...)
函数,例如
//div[contains(@class,"threadlist_lz") and contains(@class, "clearfix")]/div/a[contains(@class,"j_th_tit")]/@href
This approach is less precise, but most of the time sufficient. 这种方法不太精确,但是在大多数情况下就足够了。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.