如何从 Google 搜索结果页面中抓取所有结果 (Python/Selenium ChromeDriver)

Question

我正在使用 selenium chromedriver 编写 Python 脚本，以从指定数量的结果页面中抓取所有谷歌搜索结果（链接、标题、文本）。

我的代码似乎只是从第一页之后的所有页面中抓取第一个结果。 我认为这与我的 for 循环在刮削函数中的设置方式有关，但我无法将其调整为按我希望的方式工作。 任何有关如何修复/更好地解决此问题的建议表示赞赏。

# create instance of webdriver
driver = webdriver.Chrome()
url = 'https://www.google.com'
driver.get(url)

# set keyword
keyword = 'cars' 
# we find the search bar using it's name attribute value
searchBar = driver.find_element_by_name('q')
# first we send our keyword to the search bar followed by the ent
searchBar.send_keys(keyword)
searchBar.send_keys('\n')

def scrape():
   pageInfo = []
   try:
      # wait for search results to be fetched
      WebDriverWait(driver, 10).until(
      EC.presence_of_element_located((By.CLASS_NAME, "g"))
      )
    
   except Exception as e:
      print(e)
      driver.quit()
   # contains the search results
   searchResults = driver.find_elements_by_class_name('g')
   for result in searchResults:
       element = result.find_element_by_css_selector('a')
       link = element.get_attribute('href')
       header = result.find_element_by_css_selector('h3').text
       text = result.find_element_by_class_name('IsZvec').text
       pageInfo.append({
           'header' : header, 'link' : link, 'text': text
       })
       return pageInfo

# Number of pages to scrape
numPages = 5
# All the scraped data
infoAll = []
# Scraped data from page 1
infoAll.extend(scrape())

for i in range(0 , numPages - 1):
   nextButton = driver.find_element_by_link_text('Next')
   nextButton.click()
   infoAll.extend(scrape())

print(infoAll)

Answer 1

你有一个缩进问题：

您应该在 for 循环之外return pageInfo ，否则您将在第一次循环执行后返回结果

for result in searchResults:
       element = result.find_element_by_css_selector('a')
       link = element.get_attribute('href')
       header = result.find_element_by_css_selector('h3').text
       text = result.find_element_by_class_name('IsZvec').text
       pageInfo.append({
           'header' : header, 'link' : link, 'text': text
       })
       return pageInfo

像这样：

for result in searchResults:
       element = result.find_element_by_css_selector('a')
       link = element.get_attribute('href')
       header = result.find_element_by_css_selector('h3').text
       text = result.find_element_by_class_name('IsZvec').text
       pageInfo.append({
           'header' : header, 'link' : link, 'text': text
       })
return pageInfo

我已经运行了你的代码并得到了结果：

[{'header': 'Cars (film) — Wikipédia', 'link': 'https://fr.wikipedia.org/wiki/Cars_(film)', 'text': "Cars : Quatre Roues, ou Les Bagnoles au Québec (Cars), est le septième long-métrage d'animation entièrement en images de synthèse des studios Pixar。\\nPays d'origine : États-Unis\\nDurée : 116 分钟\\nSociétés\\nSociétés\\nSociétés de Animation Studio动画\\n汽车总动员 2 · Michel Fortin · Flash McQueen"}, {'header': 'Cars - Wikipedia, la enciclopedia libre', 'link': 'https://es.wikipedia.org/wiki/Cars', 'text ': 'Cars es una película de animación por computadora de 2006, producida por Pixar Animation Studios y lanzada por Walt Disney Studios Motion Pictures.\\nAño : 2006\\nGénero : Animación; 冒险家; 喜剧； Infa...\\n历史：John Lasseter Joe Ranft Jorgen Klubi...\\n出品：Walt Disney Pictures； 皮克斯动画...'}, {'header': '', 'link': 'https://fr.wikipedia.org/wiki/Flash_McQueen', 'text': ''}, {'header': ' ', 'link': 'https://www.allocine.fr/film/fichefilm-55774/secrets-tournage/', 'text': ''}, {'header': '', 'link': ' https://fr.wikipedia.org/wiki/Martin_(Cars)', 'text': ''},

建议：

使用计时器来控制您的 for 循环，否则您可能会因可疑活动而被谷歌禁止

步骤： 1.- from time import sleep ： from time import sleep 2.- 在最后一个循环中添加一个计时器：

for i in range(0 , numPages - 1):
    sleep(5) #It'll wait 5 seconds for each iteration
    nextButton = driver.find_element_by_link_text('Next')
    nextButton.click()
    infoAll.extend(scrape())

如何从 Google 搜索结果页面中抓取所有结果 (Python/Selenium ChromeDriver)

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-10-22 23:57:28

如何从 Google 搜索结果页面中抓取所有结果 (Python/Selenium ChromeDriver)

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-10-22 23:57:28

解决方案1
0 已采纳 2020-10-22 23:57:28