簡體   English   中英

使用 BeautifulSoup 和 Selenium 從動態網頁抓取網址的問題

[英]Issue with scraping urls from dynamic webpage with BeautifulSoup and Selenium

我是 web 第一次抓取,我無法從網站抓取網址列表。 當我用 /usr/lib/chromium-browser/chromedriver 替換指定的路徑時,它在 colaboratory 上運行良好,但是當我在 IDE 上嘗試此代碼時......

只需在head模式下使用chrome 換句話說,不要使用headless

from bs4 import BeautifulSoup
from selenium import webdriver

options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
wd = webdriver.Chrome(options=options)

courses = []
for i in range(1, 2):
    wd.get(f"https://www.sydney.edu.au/courses/search.html?search-type=course&page={i}")
    html_soup = BeautifulSoup(wd.page_source, "lxml")
    for x in html_soup.findAll("a", class_="b-result-container__item-wrapper b-result-container__item-wrapper--data b-link--no-underline"):
        courses.append(x.get("href"))

for x in courses:
    print(x)

Output:

https://www.sydney.edu.au/courses/courses/uc/bachelor-of-arts.html
https://www.sydney.edu.au/courses/courses/uc/bachelor-of-science.html
https://www.sydney.edu.au/courses/courses/uc/bachelor-of-commerce.html
https://www.sydney.edu.au/courses/courses/uc/bachelor-of-economics.html
https://www.sydney.edu.au/courses/courses/uc/bachelor-of-psychology0.html
https://www.sydney.edu.au/courses/courses/uc/bachelor-of-pharmacy.html
https://www.sydney.edu.au/courses/courses/uc/bachelor-of-music.html
https://www.sydney.edu.au/courses/courses/uc/bachelor-of-science-health.html
https://www.sydney.edu.au/courses/courses/uc/bachelor-of-arts-honours.html
https://www.sydney.edu.au/courses/courses/uc/bachelor-of-advanced-computing.html
https://www.sydney.edu.au/courses/courses/uc/bachelor-of-oral-health.html
https://www.sydney.edu.au/courses/courses/uc/bachelor-of-visual-arts.html

由於HeadlessChrome/89.0.4389.90 header,您會收到此錯誤。 它在錯誤回溯中:

darkorange", source: https://www.sydney.edu.au/etc.clientlibs/courses/clientlibs/frontend-js.js (11714)
[0323/232203.250:INFO:CONSOLE(3)] "Hotjar not launching due to suspicious userAgent: Mozilla/5.0 (Windows NT 1
0.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/89.0.4389.90 Safari/537.36", source: ht
tps://static.hotjar.com/c/hotjar-550296.js?sv=6 (3)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM