使用xpath的href属性为空（python3）

Question

Using chrome and xpath in python3, I try to extract the value of an "href" attribute on this web page . 在python3中使用chrome和xpath，我尝试提取此Web页面上 “ href”属性的值。 "href" attributes contains the link to the movie's trailer ("bande-annonce" in french) I am interested in. “ href”属性包含我感兴趣的电影预告片的链接（法语中的“ bande-annonce”）。

First thing, using xpath, it appears that the "a" tag is a "span" tag. 首先，使用xpath，似乎“ a”标签是“ span”标签。 In fact, using this code: 实际上，使用以下代码：

response_main=urllib.request.urlopen("http://www.allocine.fr/film/fichefilm_gen_cfilm=231874.html")
htmlparser = etree.HTMLParser()
tree_main = etree.parse(response_main, htmlparser)
tree_main.xpath('//*[@id=\"content-start\"]/article/section[3]/div[2]/div/div/div/div[1]/*')

I get this result: 我得到这个结果：

[<Element span at 0x111f70c08>]

So the "div" tag contains no "a" tag but just a "span" tag. 因此，“ div”标签不包含“ a”标签，而仅包含“ span”标签。 I've read that html visualization in browsers doesn't always reflects the "real" html sent by the server. 我读过，浏览器中的html可视化并不总是反映服务器发送的“真实” html。 Thus I tried to use this command to extract the href: 因此，我尝试使用此命令来提取href：

    response_main=urllib.request.urlopen("http://www.allocine.fr/film/fichefilm_gen_cfilm=231874.html")
htmlparser = etree.HTMLParser()
tree_main = etree.parse(response_main, htmlparser)
tree_main.xpath('//*[@id=\"content-start\"]/article/section[3]/div[2]/div/div/div/div[1]/span/@href')

Unfortunately, this returns nothing... And when I check the attributes within the "span" tag with this command: 不幸的是，这什么也没有返回。当我使用以下命令检查“ span”标签内的属性时：

tree_main.xpath('//*[@id=\"content-start\"]/article/section[3]/div[2]/div/div/div/div[1]/span/@*')

I got the value of the "class" attribute, but nothing about "href"... : 我得到了“ class”属性的值，但是没有关于“ href” ...的信息：

['ACrL3ZACrpZGVvL3BsYXllcl9nZW5fY21lZGlhPTE5NTYwMDcyJmNmaWxtPTIzMTg3NC5odG1s meta-title-link']

I'd like some help to understand what's happening here. 我想要一些帮助来了解这里发生的事情。 Why the "a" tag is a "span" tag? 为什么“ a”标签是“ span”标签？ And the most important question to me, how can I extract the value of the "href" attribute? 对我来说最重要的问题是，如何提取“ href”属性的值？

Thanks a lot for your help! 非常感谢你的帮助！

Answer 1

Required link generated dynamically with JavaScript . 使用JavaScript动态生成的必需链接。 With urllib.request you can get only initial HTML page source while you need HTML after all JavaScript been executed. 使用urllib.request您可以仅获取初始HTML页面源，而在执行所有JavaScript之后则需要HTML 。

You might use selenium + chromedriver to get dynamically generated content: 您可以使用selenium + chromedriver获取动态生成的内容：

from selenium import webdriver as web
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait as wait 

driver = web.Chrome("/path/to/chromedriver")
driver.get("http://www.allocine.fr/film/fichefilm_gen_cfilm=231874.html")
link = wait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//div[@class='meta-title']/a[@class='xXx meta-title-link']")))
print(link.get_attribute('href'))

使用xpath的href属性为空（python3）

问题描述

1 个解决方案

解决方案1
2 2017-03-20 11:56:33

使用xpath的href属性为空（python3）

问题描述

1 个解决方案

解决方案1 2 2017-03-20 11:56:33

解决方案1
2 2017-03-20 11:56:33