简体   繁体   English

Python - Selenium:通过 find_elements_by() 循环抓取 AngularJS 元素

[英]Python - Selenium : Scraping AngularJS elements with loop over find_elements_by()

I'm scraping real estate data.我正在抓取房地产数据。 On sites generated with javascript Selenium does a splendid job: you find the tags that hold the relevant information and loop over all of them with在使用 javascript Selenium 生成的网站上,Selenium 做得非常出色:您可以找到包含相关信息的标签,并使用

driver.find_elements_by...

But on this site , the listings are produced by angular js.但是在这个站点上,列表是由 angular js 生成的。 I tried the same approach:我尝试了同样的方法:

for article in driver.find_elements_by_css_selector("div.property.ng-scope"):
    do something

I figured out that I have to make my webdriver (phantomJS) click the link leading to the individual listings' site:我发现我必须让我的网络驱动程序(phantomJS)单击通向各个列表站点的链接:

linkbase = article.find_element_by_css_selector("div.info.clear.ng-scope")
link = linkbase.find_element_by_tag_name('a')
link.click()

Then the webdriver is simply pointed towards that site and I can get all the information I want for one listing .然后 webdriver 只是指向该站点,我可以获得我想要的一个列表的所有信息。

As soon as one run through the loop ends, I get the following error:一旦循环结束,我就会收到以下错误:

> Message: {"errorMessage":"Element does not exist in cache","request":{"headers":
{"Accept":"application/json","Accept-Encoding":"identity","Connection":"close","
Content-Length":"142","Content-Type":"application/json;charset=UTF-8","Host":"12
7.0.0.1:56577","User-Agent":"Python-urllib/3.4"},"httpVersion":"1.1","method":"P
OST","post":"{\"sessionId\": \"f9ec2c10-dfd9-11e5-9d4c-3bbe8f5bf7c0\", \"using\"
: \"css selector\", \"id\": \":wdc:1456856343349\", \"value\": \"div.info.clear.
ng-scope\"}","url":"/element","urlParsed":{"anchor":"","query":"","file":"elemen
t","directory":"/","path":"/element","relative":"/element","port":"","host":"","
password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/ele
ment","queryKey":{},"chunks":["element"]},"urlOriginal":"/session/f9ec2c10-dfd9-
11e5-9d4c-3bbe8f5bf7c0/element/:wdc:1456856343349/element"}}

The element containing the link on the page is:页面上包含链接的元素是:

<a ng-href="/detail/prodej/dum/rodinny/jemnice-jemnice-/3800125532" ng-click="beforeOpen(i.iterator, i.regionTip)" class="title" href="/detail/prodej/dum/rodinny/jemnice-jemnice-/3800125532">
<span class="name ng-binding"> ... </a>

Which is just the title text of each listing.这只是每个列表的标题文本。 I did set a user-agent following this answer even though it doesn't appear in the error.我确实按照这个答案设置了一个用户代理,即使它没有出现在错误中。 Also I wait before the surrounding element is loaded:我也在加载周围的元素之前等待:

wait = WebDriverWait(driver, getSearchResults_CZ.waiting)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "div.content")))

What I want is to parse all these property elements, save their links to a list and then loop through the list, opening each link with driver.get() I know that by clicking the link, the driver url changes, but I thought that once the list of articles has been established with find_elements_by , it would serve as a stable reference point.我想要的是解析所有这些属性元素,将它们的链接保存到列表中,然后遍历列表,使用driver.get()打开每个链接我知道通过单击链接,驱动程序 url 会发生变化,但我认为一旦使用find_elements_by建立了文章列表,它将作为一个稳定的参考点。 Accessing the link by searching for the "a" tag and calling get_attribute('href') didn't work in this case with the angular js framework.在这种情况下,通过搜索“a”标签并调用get_attribute('href') 来访问链接在 angular js 框架中不起作用。 What am I not seeing?我没有看到什么?

EDIT: As answered, get_attribute without .click() is the right way to go.编辑:正如所回答的,没有 .click() 的 get_attribute 是正确的方法。 My original error was related to the CSS selector: I have been using "div[class^='property']" and got a totally different link.我最初的错误与 CSS 选择器有关:我一直在使用“div[class^='property']”并得到了一个完全不同的链接。 Must have found another element I hadn't seen before.一定是发现了另一个我以前没见过的元素。

Wait for at least one "property" to be visible and then grab the links:等待至少一个“属性”可见,然后获取链接:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Firefox()
driver.get("http://www.sreality.cz/hledani/prodej/domy?region=jemnice")
driver.maximize_window()

wait = WebDriverWait(driver, 10)
wait.until(EC.visibility_of_element_located((By.CLASS_NAME, "property")))

links = [link.get_attribute("href") for link in driver.find_elements_by_css_selector("div.property div.info a")]
print(links)

driver.close()

Works for me.为我工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM