簡體   English   中英

使用 selenium 和 BeautifulSoup 抓取動態 web 頁面但新頁面不斷彈出

[英]use selenium and BeautifulSoup scraping dynamic web page but new pages keep poping up

我正在從動態 web 頁面中抓取內容。 https://www.nytimes.com/search?query=china+COVID-19我想獲取所有新聞文章的內容(共 26,783 篇)。 我無法迭代頁面,因為在此網站上您必須單擊“顯示更多”才能加載下一頁。

因此,我使用的是 webdriver.ActionChians。 該代碼沒有顯示任何錯誤消息,但每隔幾秒鍾就會彈出一個新的 window ,並且每次看起來都是同一個頁面。 這個過程似乎沒完沒了,我在 2 小時后中斷了它。 我使用了代碼“print(article)”,但沒有顯示。 有人可以幫我解決這個問題嗎? 非常感謝您的幫助!

import time
import requests
from bs4 import BeautifulSoup
import json
import string
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
# Initialize webdriver.Chrome and webdriver.ActionChains only once
chromedriver_path = 'C:/chromedriver.exe'
driver = webdriver.Chrome(chromedriver_path)
action = webdriver.ActionChains(driver)

# Get to the page
driver.get('https://www.nytimes.com/search?query=china+COVID-19')
soup = BeautifulSoup(driver.page_source, 'html.parser')

# While button is present
while soup.find('button', {'data-testid': 'search-show-more-button'}) != None:
    # Find button
    button = driver.find_element_by_xpath('//button[@type="button"][contains(.,"Show More")]')
    # Move to it to avoid false-clicking other elements
    action.move_to_element(button).perform()
    # Click the button
    button.click()
    # Redefine variable 'soup' in case if button dissapeared, so the 'while' loop will end
    soup = BeautifulSoup(driver.page_source, 'html.parser')        


search_results = soup.find('ol', {'data-testid':'search-results'})

links = search_results.find_all('a')
for link in links:
    link_url = link['href']

    response = requests.get(base + link_url)
    soup_link = BeautifulSoup(response.text, 'html.parser')
    scripts = soup_link.find_all('script')
    for script in scripts:
        if 'window.__preloadedData = ' in script.text:
            jsonStr = script.text
            jsonStr = jsonStr.split('window.__preloadedData = ')[-1]
            jsonStr = jsonStr.rsplit(';',1)[0]

            jsonData = json.loads(jsonStr)

            article = []
            for k, v in jsonData['initialState'].items():
                w=1
                try:
                    if v['__typename'] == 'TextInline':
                        article.append(v['text'])
                        #print (v['text'])
                except:
                    continue
            article = [ each.strip() for each in article ]
            article = ''.join([('' if c in string.punctuation else ' ')+c for c in article]).strip()
            print(article)
            myarticle.append(article)


df = pd.DataFrame(myarticle, columns = ['article'])

df.to_csv('NYtimes.csv')

print("Complete")

browser.quit()

output

---------------------------------------------------------------------------
ElementClickInterceptedException          Traceback (most recent call last)
<ipython-input-7-1515a65b3c60> in <module>
     24         try:
---> 25             button.click()
     26             break

~\anaconda3\lib\site-packages\selenium\webdriver\remote\webelement.py in click(self)
     79         """Clicks the element."""
---> 80         self._execute(Command.CLICK_ELEMENT)
     81 

~\anaconda3\lib\site-packages\selenium\webdriver\remote\webelement.py in _execute(self, command, params)
    632         params['id'] = self._id
--> 633         return self._parent.execute(command, params)
    634 

~\anaconda3\lib\site-packages\selenium\webdriver\remote\webdriver.py in execute(self, driver_command, params)
    320         if response:
--> 321             self.error_handler.check_response(response)
    322             response['value'] = self._unwrap_value(

~\anaconda3\lib\site-packages\selenium\webdriver\remote\errorhandler.py in check_response(self, response)
    241             raise exception_class(message, screen, stacktrace, alert_text)
--> 242         raise exception_class(message, screen, stacktrace)
    243 

ElementClickInterceptedException: Message: element click intercepted: Element <button data-testid="search-show-more-button" type="button">...</button> is not clickable at point (509, 656). Other element would receive the click: <div class="css-1n5jm1v">...</div>
  (Session info: chrome=83.0.4103.61)


During handling of the above exception, another exception occurred:

NameError                                 Traceback (most recent call last)
<ipython-input-7-1515a65b3c60> in <module>
     25             button.click()
     26             break
---> 27         except ElementClickInterceptedException:
     28             time.sleep(0.5)
     29     # Redefine variable 'soup' in case if button dissapeared, so the 'while' loop will end

NameError: name 'ElementClickInterceptedException' is not defined

彈出“新窗口”是因為您在每個循環的迭代中重新創建了驅動程序。

一步步。 首先,您在此處創建驅動程序並進入頁面:

browser = webdriver.Chrome('C:/chromedriver.exe')
browser.get('https://www.nytimes.com/search?query=china+COVID-19')

然后在循環內每次迭代創建一個驅動程序:

while True:
    try:
        driver = webdriver.Chrome('C:/chromedriver.exe')
        driver.get('https://www.nytimes.com/search?query=china+COVID-19')

這就是為什么您每次都會看到新的 window 的原因。

要解決此問題,您可以應用此代碼(這僅包括迭代部分):

from selenium.common.exceptions import ElementClickInterceptedException
from selenium import webdriver
from bs4 import BeautifulSoup
import time

# Initialize webdriver.Chrome and webdriver.ActionChains only once
chromedriver_path = 'C:/chromedriver.exe'
driver = webdriver.Chrome(chromedriver_path)
action = webdriver.ActionChains(driver)

# Get to the page
driver.get('https://www.nytimes.com/search?query=china+COVID-19')
soup = BeautifulSoup(driver.page_source, 'html.parser')

# While button is present
while soup.find('button', {'data-testid': 'search-show-more-button'}) != None:
    # Find button
    button = driver.find_element_by_xpath('//button[@type="button"][contains(.,"Show More")]')
    # Move to it to avoid false-clicking other elements
    action.move_to_element(button).perform()
    # Movement takes some time and not instant, therefore it is better to add a short wait
    # to make sure that ElementClickInterceptedException won't appear
    time.sleep(0.5)
    # However, constant time sleep is not reliable if something unexpected happened and more
    # time was required, therefore let's just create an endless loop, which will break once
    # 'click' was successful. According to your last error, the 'covering element' was a 'div'.
    # In other words, even by false-clicking you won't cause any action, which is why this
    # solution is save.
    while True:
        try:
            button.click()
            break
        except ElementClickInterceptedException:
            time.sleep(0.5)
    # Redefine variable 'soup' in case if button dissapeared, so the 'while' loop will end
    soup = BeautifulSoup(driver.page_source, 'html.parser')

據我所知,關於第二部分沒有任何問題,你在哪里解析搜索結果,但如果你有一些問題,請隨時提問。

UPD:每次迭代初始化 ActionChains 也是沒有意義的,因此您可以在創建 webdriver 后立即執行此操作。 (我已經更改了代碼示例,因此您可以簡單地復制和閱讀每個步驟的注釋)

UPD2:我添加了一些額外的保護來避免誤點擊。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM