[英]Scraping contents of multi web pages of a website using BeautifulSoup and Selenium
[英]use selenium and BeautifulSoup scraping dynamic web page but new pages keep poping up
我正在從動態 web 頁面中抓取內容。 https://www.nytimes.com/search?query=china+COVID-19我想獲取所有新聞文章的內容(共 26,783 篇)。 我無法迭代頁面,因為在此網站上您必須單擊“顯示更多”才能加載下一頁。
因此,我使用的是 webdriver.ActionChians。 該代碼沒有顯示任何錯誤消息,但每隔幾秒鍾就會彈出一個新的 window ,並且每次看起來都是同一個頁面。 這個過程似乎沒完沒了,我在 2 小時后中斷了它。 我使用了代碼“print(article)”,但沒有顯示。 有人可以幫我解決這個問題嗎? 非常感謝您的幫助!
import time
import requests
from bs4 import BeautifulSoup
import json
import string
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
# Initialize webdriver.Chrome and webdriver.ActionChains only once
chromedriver_path = 'C:/chromedriver.exe'
driver = webdriver.Chrome(chromedriver_path)
action = webdriver.ActionChains(driver)
# Get to the page
driver.get('https://www.nytimes.com/search?query=china+COVID-19')
soup = BeautifulSoup(driver.page_source, 'html.parser')
# While button is present
while soup.find('button', {'data-testid': 'search-show-more-button'}) != None:
# Find button
button = driver.find_element_by_xpath('//button[@type="button"][contains(.,"Show More")]')
# Move to it to avoid false-clicking other elements
action.move_to_element(button).perform()
# Click the button
button.click()
# Redefine variable 'soup' in case if button dissapeared, so the 'while' loop will end
soup = BeautifulSoup(driver.page_source, 'html.parser')
search_results = soup.find('ol', {'data-testid':'search-results'})
links = search_results.find_all('a')
for link in links:
link_url = link['href']
response = requests.get(base + link_url)
soup_link = BeautifulSoup(response.text, 'html.parser')
scripts = soup_link.find_all('script')
for script in scripts:
if 'window.__preloadedData = ' in script.text:
jsonStr = script.text
jsonStr = jsonStr.split('window.__preloadedData = ')[-1]
jsonStr = jsonStr.rsplit(';',1)[0]
jsonData = json.loads(jsonStr)
article = []
for k, v in jsonData['initialState'].items():
w=1
try:
if v['__typename'] == 'TextInline':
article.append(v['text'])
#print (v['text'])
except:
continue
article = [ each.strip() for each in article ]
article = ''.join([('' if c in string.punctuation else ' ')+c for c in article]).strip()
print(article)
myarticle.append(article)
df = pd.DataFrame(myarticle, columns = ['article'])
df.to_csv('NYtimes.csv')
print("Complete")
browser.quit()
output
---------------------------------------------------------------------------
ElementClickInterceptedException Traceback (most recent call last)
<ipython-input-7-1515a65b3c60> in <module>
24 try:
---> 25 button.click()
26 break
~\anaconda3\lib\site-packages\selenium\webdriver\remote\webelement.py in click(self)
79 """Clicks the element."""
---> 80 self._execute(Command.CLICK_ELEMENT)
81
~\anaconda3\lib\site-packages\selenium\webdriver\remote\webelement.py in _execute(self, command, params)
632 params['id'] = self._id
--> 633 return self._parent.execute(command, params)
634
~\anaconda3\lib\site-packages\selenium\webdriver\remote\webdriver.py in execute(self, driver_command, params)
320 if response:
--> 321 self.error_handler.check_response(response)
322 response['value'] = self._unwrap_value(
~\anaconda3\lib\site-packages\selenium\webdriver\remote\errorhandler.py in check_response(self, response)
241 raise exception_class(message, screen, stacktrace, alert_text)
--> 242 raise exception_class(message, screen, stacktrace)
243
ElementClickInterceptedException: Message: element click intercepted: Element <button data-testid="search-show-more-button" type="button">...</button> is not clickable at point (509, 656). Other element would receive the click: <div class="css-1n5jm1v">...</div>
(Session info: chrome=83.0.4103.61)
During handling of the above exception, another exception occurred:
NameError Traceback (most recent call last)
<ipython-input-7-1515a65b3c60> in <module>
25 button.click()
26 break
---> 27 except ElementClickInterceptedException:
28 time.sleep(0.5)
29 # Redefine variable 'soup' in case if button dissapeared, so the 'while' loop will end
NameError: name 'ElementClickInterceptedException' is not defined
彈出“新窗口”是因為您在每個循環的迭代中重新創建了驅動程序。
一步步。 首先,您在此處創建驅動程序並進入頁面:
browser = webdriver.Chrome('C:/chromedriver.exe')
browser.get('https://www.nytimes.com/search?query=china+COVID-19')
然后在循環內每次迭代創建一個驅動程序:
while True:
try:
driver = webdriver.Chrome('C:/chromedriver.exe')
driver.get('https://www.nytimes.com/search?query=china+COVID-19')
這就是為什么您每次都會看到新的 window 的原因。
要解決此問題,您可以應用此代碼(這僅包括迭代部分):
from selenium.common.exceptions import ElementClickInterceptedException
from selenium import webdriver
from bs4 import BeautifulSoup
import time
# Initialize webdriver.Chrome and webdriver.ActionChains only once
chromedriver_path = 'C:/chromedriver.exe'
driver = webdriver.Chrome(chromedriver_path)
action = webdriver.ActionChains(driver)
# Get to the page
driver.get('https://www.nytimes.com/search?query=china+COVID-19')
soup = BeautifulSoup(driver.page_source, 'html.parser')
# While button is present
while soup.find('button', {'data-testid': 'search-show-more-button'}) != None:
# Find button
button = driver.find_element_by_xpath('//button[@type="button"][contains(.,"Show More")]')
# Move to it to avoid false-clicking other elements
action.move_to_element(button).perform()
# Movement takes some time and not instant, therefore it is better to add a short wait
# to make sure that ElementClickInterceptedException won't appear
time.sleep(0.5)
# However, constant time sleep is not reliable if something unexpected happened and more
# time was required, therefore let's just create an endless loop, which will break once
# 'click' was successful. According to your last error, the 'covering element' was a 'div'.
# In other words, even by false-clicking you won't cause any action, which is why this
# solution is save.
while True:
try:
button.click()
break
except ElementClickInterceptedException:
time.sleep(0.5)
# Redefine variable 'soup' in case if button dissapeared, so the 'while' loop will end
soup = BeautifulSoup(driver.page_source, 'html.parser')
據我所知,關於第二部分沒有任何問題,你在哪里解析搜索結果,但如果你有一些問題,請隨時提問。
UPD:每次迭代初始化 ActionChains 也是沒有意義的,因此您可以在創建 webdriver 后立即執行此操作。 (我已經更改了代碼示例,因此您可以簡單地復制和閱讀每個步驟的注釋)
UPD2:我添加了一些額外的保護來避免誤點擊。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.