[英]Scraping Headlines From News Website Homepages Using BeautifulSoup in Python
[英]scraping headlines from news website with infinite loading
我想從這個網站上抓取頭條新聞: https://www.marketwatch.com/latest-news?mod=top_nav
我需要加載較早的新聞,所以點擊藍色按鈕“查看更多”是必要的。
我創建了這段代碼,但沒有奏效:
from bs4 import BeautifulSoup
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
u = 'https://www.marketwatch.com/latest-news?mod=top_nav' #US Business
driver = webdriver.Chrome(executable_path=r"C:/chromedriver.exe")
driver.maximize_window()
driver.get(u)
time.sleep(10)
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CLASS_NAME,'close-btn'))).click()
time.sleep(10)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
for i in range(3):
element =WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR,'component.component--module.more-headlines div.group.group--buttons.cover > a.btn.btn--secondary.js--more-headlines)))
driver.execute_script("arguments[0].scrollIntoView();", element)
element.click()
time.sleep(5)
driver.execute_script("arguments[0].scrollIntoView();", element)
print(f'click {i} done')
soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.quit()
它返回此錯誤:
raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:
這樣的事情會更可靠:
for i in range(3):
driver.execute_script('''
document.querySelector('a.js--more-headlines').click()
''')
time.sleep(1)
請注意,當您從 javascript 中單擊時,您不必滾動到視圖中
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.