![](/img/trans.png)
[英]How do I click on “More” button when webscraping Tripadvisor using selenium?
[英]Load More using Selenium on Webscraping
我试图在路透社上进行网络抓取以进行 nlp 分析,并且大部分都在工作,但是我无法获得代码以单击“加载更多”按钮以获取更多新闻文章。 下面是当前使用的代码:
import csv
import time
import pprint
from datetime import datetime, timedelta
import requests
import nltk
nltk.download('vader_lexicon')
from urllib.request import urlopen
from bs4 import BeautifulSoup
from bs4.element import Tag
comp_name = 'Apple'
url = 'https://www.reuters.com/search/news?blob=' + comp_name + '&sortBy=date&dateRange=all'
res = requests.get(url.format(1))
soup = BeautifulSoup(res.text,"lxml")
for item in soup.find_all("h3",{"class":"search-result-title"}):
s = str(item)
article_addr = s.partition('a href="')[2].partition('">')[0]
headline = s.partition('a href="')[2].partition('">')[2].partition('</a></h3>')[0]
article_link = 'https://www.reuters.com' + article_addr
try:
resp = requests.get(article_addr)
except Exception as e:
try:
resp = requests.get(article_link)
except Exception as e:
continue
sauce = BeautifulSoup(resp.text,"lxml")
dateTag = sauce.find("div",{"class":"ArticleHeader_date"})
contentTag = sauce.find("div",{"class":"StandardArticleBody_body"})
date = None
title = None
content = None
if isinstance(dateTag,Tag):
date = dateTag.get_text().partition('/')[0]
if isinstance(contentTag,Tag):
content = contentTag.get_text().strip()
time.sleep(3)
link_soup = BeautifulSoup(content)
sentences = link_soup.findAll("p")
print(date, headline, article_link)
from selenium import webdriver
from selenium.webdriver.common.keys import keys
import time
browser = webdriver.Safari()
browser.get('https://www.reuters.com/search/news?blob=' + comp_name + '&sortBy=date&dateRange=all')
try:
element = WebDriverWait(browser, 3).until(EC.presence_of_element_located((By.ID,'Id_Of_Element')))
except TimeoutException:
print("Time out!")
要将文本作为LOAD MORE RESULTS单击元素,您需要为element_to_be_clickable()
引入WebDriverWait ,您可以使用以下定位器策略:
代码块:
from selenium import webdriver from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC options = webdriver.ChromeOptions() options.add_argument("start-maximized") options.add_experimental_option("excludeSwitches", ["enable-automation"]) options.add_experimental_option('useAutomationExtension', False) driver = webdriver.Chrome(options=options, executable_path=r'C:\\WebDrivers\\chromedriver.exe') comp_name = 'Apple' driver.get('https://www.reuters.com/search/news?blob=' + comp_name + '&sortBy=date&dateRange=all') while True: try: driver.execute_script("return arguments[0].scrollIntoView(true);", WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.XPATH, "//div[@class='search-result-more-txt']")))) WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//div[@class='search-result-more-txt']"))).click() print("LOAD MORE RESULTS button clicked") except TimeoutException: print("No more LOAD MORE RESULTS button to be clicked") break driver.quit()
控制台输出:
LOAD MORE RESULTS button clicked LOAD MORE RESULTS button clicked LOAD MORE RESULTS button clicked . . No more LOAD MORE RESULTS button to be clicked
您可以在以下位置找到相关的详细讨论:
点击LOAD MORE RESULTS
诱导WebDriverWait
() 和element_to_be_clickable
()
使用 while 循环并检查 counter<11 以点击 10 次。
我已经在 Chrome 上进行了测试,因为我没有 safari 浏览器,但它也应该可以工作。
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
comp_name="Apple"
browser = webdriver.Chrome()
browser.get('https://www.reuters.com/search/news?blob=' + comp_name + '&sortBy=date&dateRange=all')
#Accept the trems button
WebDriverWait(browser,10).until(EC.element_to_be_clickable((By.CSS_SELECTOR,"button#_evidon-banner-acceptbutton"))).click()
i=1
while i<11:
try:
element = WebDriverWait(browser,10).until(EC.element_to_be_clickable((By.XPATH,"//div[@class='search-result-more-txt' and text()='LOAD MORE RESULTS']")))
element.location_once_scrolled_into_view
browser.execute_script("arguments[0].click();", element)
print(i)
i=i+1
except TimeoutException:
print("Time out!")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.