简体   繁体   English

Selenium 无法在 HTML 页面中找到所有元素

[英]Selenium not able to find all elements in HTML page

I am doing web scraping to the real estate portal <www.immobiliare.it>我正在做 web 抓取到房地产门户网站 <www.immobiliare.it>

Specifically I am retrieving some information from the search page, which contains 25 properties per page.具体来说,我正在从搜索页面中检索一些信息,每页包含 25 个属性。 I have managed to retrieved almost everything but I am having trouble to retrieve the src of a map image that each property has.我设法检索了几乎所有内容,但我无法检索每个属性具有的 map 图像的 src。 This map is after a CSS selector.这个 map 在 CSS 选择器之后。

The HTML structure is the following: HTML结构如下: 每个 li class="nd-list__item in-realEstateResults__item" 都是我想从中提取 img src 的属性

I have been able to get this data with selenium: https://stackoverflow.com/a/75020969/14461986我已经能够通过 selenium 获取此数据: https://stackoverflow.com/a/75020969/14461986

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager

Options = Options()
Options.headless = True

driver = webdriver.Chrome(options=Options, service=Service(ChromeDriverManager().install()))
url = 'https://www.immobiliare.it/vendita-case/milano/forlanini/?criterio=dataModifica&ordine=desc&page=3'
driver.get(url)

soup = BeautifulSoup(driver.page_source)

data = []

# Each property is contained under each li in-realEstateResults__item
for property in soup.select('li.in-realEstateResults__item'):

    data.append({
            'id': property.get('id'),
            'MapUrl': property.select_one('[alt="mappa"]').get('src') if property.select_one('[alt="mappa"]') else None
        })

print(data)

However, after the 4th image the MapUrl comes empty.但是,在第 4 个图像之后,MapUrl 变空了。 The properties are correcty loaded as I have checked the Ids and also the HTML for the rest of the images is the same but for a reason I do not understand the MapUrl is not retrieved.这些属性已正确加载,因为我已经检查了 ID,并且图像的 rest 的 HTML 是相同的,但出于某种原因我不明白未检索到 MapUrl。 I would also welcome any advice on how make this script more efficient.我也欢迎任何关于如何使这个脚本更有效的建议。

However, issue here is lazy loading, so you have to interact with the website and scroll down to force the loading.但是,这里的问题是延迟加载,因此您必须与网站交互并向下滚动以强制加载。

You may have to accept / close some popups (optional):您可能必须接受/关闭一些弹出窗口(可选):

driver.find_element(By.CSS_SELECTOR,'#didomi-notice-agree-button').click()
driver.find_element(By.CSS_SELECTOR,'.nd-dialogFrame__close').click()
driver.find_element(By.CSS_SELECTOR,'section h1').click()

now we can start scrolling (simple but working solution, could be improved):现在我们可以开始滚动了(简单但有效的解决方案,可以改进):

for i in range(30):
        driver.find_element(By.CSS_SELECTOR,'body').send_keys(Keys.PAGE_DOWN)
        time.sleep(0.3)

Example例子

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

url = 'https://www.immobiliare.it/vendita-case/milano/forlanini/?criterio=dataModifica&ordine=desc'
driver.get(url)


driver.find_element(By.CSS_SELECTOR,'#didomi-notice-agree-button').click()
driver.find_element(By.CSS_SELECTOR,'.nd-dialogFrame__close').click()
driver.find_element(By.CSS_SELECTOR,'section h1').click()

for i in range(30):
        driver.find_element(By.CSS_SELECTOR,'body').send_keys(Keys.PAGE_DOWN)
        time.sleep(0.3)


soup = BeautifulSoup(driver.page_source)

data = []
for e in soup.select('li.in-realEstateResults__item'):
    data.append({
        'title':e.a.get('title'),
        'imgUrls':[i.get('src') for i in e.select('.nd-list__item img')],
        'imgMapInfo': e.select_one('[alt="mappa"]').get('src') if e.select_one('[alt="mappa"]') else None
    })

data

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM