使用 selenium 和 BeautifulSoup 抓取動態 web 頁面但新頁面不斷彈出

Question

我正在從動態 web 頁面中抓取內容。 https://www.nytimes.com/search?query=china+COVID-19我想獲取所有新聞文章的內容（共 26,783 篇）。 我無法迭代頁面，因為在此網站上您必須單擊“顯示更多”才能加載下一頁。

因此，我使用的是 webdriver.ActionChians。 該代碼沒有顯示任何錯誤消息，但每隔幾秒鍾就會彈出一個新的 window ，並且每次看起來都是同一個頁面。 這個過程似乎沒完沒了，我在 2 小時后中斷了它。 我使用了代碼“print(article)”，但沒有顯示。 有人可以幫我解決這個問題嗎？ 非常感謝您的幫助！

import time
import requests
from bs4 import BeautifulSoup
import json
import string
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

# Initialize webdriver.Chrome and webdriver.ActionChains only once
chromedriver_path = 'C:/chromedriver.exe'
driver = webdriver.Chrome(chromedriver_path)
action = webdriver.ActionChains(driver)

# Get to the page
driver.get('https://www.nytimes.com/search?query=china+COVID-19')
soup = BeautifulSoup(driver.page_source, 'html.parser')

# While button is present
while soup.find('button', {'data-testid': 'search-show-more-button'}) != None:
    # Find button
    button = driver.find_element_by_xpath('//button[@type="button"][contains(.,"Show More")]')
    # Move to it to avoid false-clicking other elements
    action.move_to_element(button).perform()
    # Click the button
    button.click()
    # Redefine variable 'soup' in case if button dissapeared, so the 'while' loop will end
    soup = BeautifulSoup(driver.page_source, 'html.parser')        


search_results = soup.find('ol', {'data-testid':'search-results'})

links = search_results.find_all('a')
for link in links:
    link_url = link['href']

    response = requests.get(base + link_url)
    soup_link = BeautifulSoup(response.text, 'html.parser')
    scripts = soup_link.find_all('script')
    for script in scripts:
        if 'window.__preloadedData = ' in script.text:
            jsonStr = script.text
            jsonStr = jsonStr.split('window.__preloadedData = ')[-1]
            jsonStr = jsonStr.rsplit(';',1)[0]

            jsonData = json.loads(jsonStr)

            article = []
            for k, v in jsonData['initialState'].items():
                w=1
                try:
                    if v['__typename'] == 'TextInline':
                        article.append(v['text'])
                        #print (v['text'])
                except:
                    continue
            article = [ each.strip() for each in article ]
            article = ''.join([('' if c in string.punctuation else ' ')+c for c in article]).strip()
            print(article)
            myarticle.append(article)


df = pd.DataFrame(myarticle, columns = ['article'])

df.to_csv('NYtimes.csv')

print("Complete")

browser.quit()

output

---------------------------------------------------------------------------
ElementClickInterceptedException          Traceback (most recent call last)
<ipython-input-7-1515a65b3c60> in <module>
     24         try:
---> 25             button.click()
     26             break

~\anaconda3\lib\site-packages\selenium\webdriver\remote\webelement.py in click(self)
     79         """Clicks the element."""
---> 80         self._execute(Command.CLICK_ELEMENT)
     81 

~\anaconda3\lib\site-packages\selenium\webdriver\remote\webelement.py in _execute(self, command, params)
    632         params['id'] = self._id
--> 633         return self._parent.execute(command, params)
    634 

~\anaconda3\lib\site-packages\selenium\webdriver\remote\webdriver.py in execute(self, driver_command, params)
    320         if response:
--> 321             self.error_handler.check_response(response)
    322             response['value'] = self._unwrap_value(

~\anaconda3\lib\site-packages\selenium\webdriver\remote\errorhandler.py in check_response(self, response)
    241             raise exception_class(message, screen, stacktrace, alert_text)
--> 242         raise exception_class(message, screen, stacktrace)
    243 

ElementClickInterceptedException: Message: element click intercepted: Element <button data-testid="search-show-more-button" type="button">...</button> is not clickable at point (509, 656). Other element would receive the click: <div class="css-1n5jm1v">...</div>
  (Session info: chrome=83.0.4103.61)


During handling of the above exception, another exception occurred:

NameError                                 Traceback (most recent call last)
<ipython-input-7-1515a65b3c60> in <module>
     25             button.click()
     26             break
---> 27         except ElementClickInterceptedException:
     28             time.sleep(0.5)
     29     # Redefine variable 'soup' in case if button dissapeared, so the 'while' loop will end

NameError: name 'ElementClickInterceptedException' is not defined

Answer 1

彈出“新窗口”是因為您在每個循環的迭代中重新創建了驅動程序。

一步步。 首先，您在此處創建驅動程序並進入頁面：

browser = webdriver.Chrome('C:/chromedriver.exe')
browser.get('https://www.nytimes.com/search?query=china+COVID-19')

然后在循環內每次迭代創建一個驅動程序：

while True:
    try:
        driver = webdriver.Chrome('C:/chromedriver.exe')
        driver.get('https://www.nytimes.com/search?query=china+COVID-19')

這就是為什么您每次都會看到新的 window 的原因。

要解決此問題，您可以應用此代碼（這僅包括迭代部分）：

from selenium.common.exceptions import ElementClickInterceptedException
from selenium import webdriver
from bs4 import BeautifulSoup
import time

# Initialize webdriver.Chrome and webdriver.ActionChains only once
chromedriver_path = 'C:/chromedriver.exe'
driver = webdriver.Chrome(chromedriver_path)
action = webdriver.ActionChains(driver)

# Get to the page
driver.get('https://www.nytimes.com/search?query=china+COVID-19')
soup = BeautifulSoup(driver.page_source, 'html.parser')

# While button is present
while soup.find('button', {'data-testid': 'search-show-more-button'}) != None:
    # Find button
    button = driver.find_element_by_xpath('//button[@type="button"][contains(.,"Show More")]')
    # Move to it to avoid false-clicking other elements
    action.move_to_element(button).perform()
    # Movement takes some time and not instant, therefore it is better to add a short wait
    # to make sure that ElementClickInterceptedException won't appear
    time.sleep(0.5)
    # However, constant time sleep is not reliable if something unexpected happened and more
    # time was required, therefore let's just create an endless loop, which will break once
    # 'click' was successful. According to your last error, the 'covering element' was a 'div'.
    # In other words, even by false-clicking you won't cause any action, which is why this
    # solution is save.
    while True:
        try:
            button.click()
            break
        except ElementClickInterceptedException:
            time.sleep(0.5)
    # Redefine variable 'soup' in case if button dissapeared, so the 'while' loop will end
    soup = BeautifulSoup(driver.page_source, 'html.parser')

據我所知，關於第二部分沒有任何問題，你在哪里解析搜索結果，但如果你有一些問題，請隨時提問。

UPD：每次迭代初始化 ActionChains 也是沒有意義的，因此您可以在創建 webdriver 后立即執行此操作。 （我已經更改了代碼示例，因此您可以簡單地復制和閱讀每個步驟的注釋）

UPD2：我添加了一些額外的保護來避免誤點擊。

使用 selenium 和 BeautifulSoup 抓取動態 web 頁面但新頁面不斷彈出

問題描述

1 個解決方案

解決方案1
1 已采納 2020-05-23 22:08:21

使用 selenium 和 BeautifulSoup 抓取動態 web 頁面但新頁面不斷彈出

問題描述

1 個解決方案

解決方案1 1 已采納 2020-05-23 22:08:21

解決方案1
1 已采納 2020-05-23 22:08:21