简体   繁体   中英

Get all the links from a webpage

In some websites (in particular, this one https://www.sothebys.com/en/results ) more links become available as you scroll the page down, however, they cannot be extracted via beautiful soup or selenium.

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager

options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument("--test-type")
options.binary_location = "/usr/bin/chromium"
driver = webdriver.Chrome(ChromeDriverManager().install())

driver.get("https://www.sothebys.com/en/results")
urls=[]
for a in driver.find_elements_by_xpath('.//a'):
    urls.append(a.get_attribute('href'))
urls

Could you please help me with the code or suggest what i should be doing?

Something to get you started:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time

driver = webdriver.Chrome()

driver.get("https://www.sothebys.com/en/results")

print(len(driver.find_elements_by_xpath('.//a')))

driver.find_element_by_xpath('.//body').send_keys(Keys.END)

time.sleep(2)

print(len(driver.find_elements_by_xpath('.//a')))

driver.find_element_by_xpath('.//body').send_keys(Keys.END)

time.sleep(2)

print(len(driver.find_elements_by_xpath('.//a')))

driver.close() # to close the browser 

Note that the number of links increases as we scroll to the bottom of the page via driver.find_element_by_xpath('.//body').send_keys(Keys.END)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM