Selenium 无法在 HTML 页面中找到所有元素

Question

I am doing web scraping to the real estate portal <www.immobiliare.it>我正在做 web 抓取到房地产门户网站 <www.immobiliare.it>

Specifically I am retrieving some information from the search page, which contains 25 properties per page.具体来说，我正在从搜索页面中检索一些信息，每页包含 25 个属性。 I have managed to retrieved almost everything but I am having trouble to retrieve the src of a map image that each property has.我设法检索了几乎所有内容，但我无法检索每个属性具有的 map 图像的 src。 This map is after a CSS selector.这个 map 在 CSS 选择器之后。

The HTML structure is the following: HTML结构如下：

I have been able to get this data with selenium: https://stackoverflow.com/a/75020969/14461986我已经能够通过 selenium 获取此数据： https://stackoverflow.com/a/75020969/14461986

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager

Options = Options()
Options.headless = True

driver = webdriver.Chrome(options=Options, service=Service(ChromeDriverManager().install()))
url = 'https://www.immobiliare.it/vendita-case/milano/forlanini/?criterio=dataModifica&ordine=desc&page=3'
driver.get(url)

soup = BeautifulSoup(driver.page_source)

data = []

# Each property is contained under each li in-realEstateResults__item
for property in soup.select('li.in-realEstateResults__item'):

    data.append({
            'id': property.get('id'),
            'MapUrl': property.select_one('[alt="mappa"]').get('src') if property.select_one('[alt="mappa"]') else None
        })

print(data)

However, after the 4th image the MapUrl comes empty.但是，在第 4 个图像之后，MapUrl 变空了。 The properties are correcty loaded as I have checked the Ids and also the HTML for the rest of the images is the same but for a reason I do not understand the MapUrl is not retrieved.这些属性已正确加载，因为我已经检查了 ID，并且图像的 rest 的 HTML 是相同的，但出于某种原因我不明白未检索到 MapUrl。 I would also welcome any advice on how make this script more efficient.我也欢迎任何关于如何使这个脚本更有效的建议。

Answer 1

However, issue here is lazy loading, so you have to interact with the website and scroll down to force the loading.但是，这里的问题是延迟加载，因此您必须与网站交互并向下滚动以强制加载。

You may have to accept / close some popups (optional):您可能必须接受/关闭一些弹出窗口（可选）：

driver.find_element(By.CSS_SELECTOR,'#didomi-notice-agree-button').click()
driver.find_element(By.CSS_SELECTOR,'.nd-dialogFrame__close').click()
driver.find_element(By.CSS_SELECTOR,'section h1').click()

now we can start scrolling (simple but working solution, could be improved):现在我们可以开始滚动了（简单但有效的解决方案，可以改进）：

for i in range(30):
        driver.find_element(By.CSS_SELECTOR,'body').send_keys(Keys.PAGE_DOWN)
        time.sleep(0.3)

Example例子

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

url = 'https://www.immobiliare.it/vendita-case/milano/forlanini/?criterio=dataModifica&ordine=desc'
driver.get(url)


driver.find_element(By.CSS_SELECTOR,'#didomi-notice-agree-button').click()
driver.find_element(By.CSS_SELECTOR,'.nd-dialogFrame__close').click()
driver.find_element(By.CSS_SELECTOR,'section h1').click()

for i in range(30):
        driver.find_element(By.CSS_SELECTOR,'body').send_keys(Keys.PAGE_DOWN)
        time.sleep(0.3)


soup = BeautifulSoup(driver.page_source)

data = []
for e in soup.select('li.in-realEstateResults__item'):
    data.append({
        'title':e.a.get('title'),
        'imgUrls':[i.get('src') for i in e.select('.nd-list__item img')],
        'imgMapInfo': e.select_one('[alt="mappa"]').get('src') if e.select_one('[alt="mappa"]') else None
    })

data

Selenium 无法在 HTML 页面中找到所有元素

问题描述

1 个解决方案

解决方案1
1 已采纳 2023-01-09 15:59:49

Example例子

Selenium 无法在 HTML 页面中找到所有元素

问题描述

1 个解决方案

解决方案1 1 已采纳 2023-01-09 15:59:49

Example例子

解决方案1
1 已采纳 2023-01-09 15:59:49