繁体   English   中英

如何从 Google 搜索结果中抓取所有标题和链接(Python + Selenium)

[英]How to scrape all the titles and links from Google search results (Python + Selenium)

我正在尝试使用 selenium (Python) 从 Google 搜索结果中抓取标题和链接。 我的问题是我只能抓取前 4 个结果,而不能抓取其他 6 个结果。在这里,结果只是空的。 我的感觉是,这可能与 web 页面的加载时间有关,但我不确定。 我一直在研究实现wait.until(EC.visibility_of_element_located语句,但还没有找到让它工作的方法。

有这方面经验的人吗? 非常感激!

代码:

import urllib
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

root = "https://www.google.com/"
url  = "https://google.com/search?q="

query = 'Why do I only see the first 4 results?' # Fill in google query
query = urllib.parse.quote_plus(query)
link = url + query

print(f'Main link to search for: {link}')

options = Options()
options.headless = True
options.add_argument("--window-size=1920,1200")
driver = webdriver.Chrome(options=options)
driver.get(link)

WebDriverWait(driver, 10)
headings = driver.find_elements_by_xpath('//div[@class = "g"]') #Heading elements
   
for heading in headings:
    
    title = heading.find_elements_by_tag_name('h3')
    links = heading.get_attribute('href') # This ain't working either, any help?
    print(links)
    #link = heading.find_element_by_name('a href')
    for t in title:
         print('title:', t.text)

您试图仅使用 class "g" 获取 div 元素。 但是,通过查看我自己的示例搜索结果,我注意到并非每个搜索结果都是 class g 的元素。 有些不同。

https://i.imgur.com/QNd6nPm.png

您需要一些不同类型的选择器,例如通过遍历包含每个“搜索结果元素”的确切 div 并通过检查与正常搜索结果匹配的每个元素属性来过滤有效的选择器。

编辑:

您尝试通过属性“href”获取链接可能不起作用,因为在我的情况下,使用 class“g”的搜索结果没有任何直接的 href 属性。 总是有一个 a-tag 后跟一个 href 属性,如下所示:

https://i.imgur.com/NHPcQTn.png

考虑到搜索结果中的第一个 a-tag 始终是您要查找的那个,您可以在标题的子元素中搜索找到的第一个 a-tag,然后从中获取“href”属性,类似的东西:

href = heading.find_element_by_tag_name("a").get_attribute("href")

您错误地为链接指定了定位器。 解决方案

import urllib
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

root = "https://www.google.com/"
url = "https://google.com/search?q="

query = 'Why do I only see the first 4 results?'  # Fill in google query
query = urllib.parse.quote_plus(query)
link = url + query

print(f'Main link to search for: {link}')

options = Options()
# options.headless = True
options.add_argument("--window-size=1920,1200")
driver = webdriver.Chrome(options=options, executable_path='/snap/bin/chromium.chromedriver')
driver.get(link)

wait = WebDriverWait(driver, 15)
wait.until(EC.presence_of_all_elements_located((By.XPATH, '//div[@class = "g"]')))
headings = driver.find_elements_by_xpath('//div[@class = "g"]')  # Heading elements

for heading in headings:

    title = heading.find_elements_by_tag_name('h3')
    links = heading.find_element_by_css_selector('.yuRUbf>a').get_attribute("href")  # This ain't working either, any help?
    print(links)
    # link = heading.find_element_by_name('a href')
    for t in title:
        print('title:', t.text)

请注意,我修复的唯一两件事是:

1 获取定位器的方式

2 显式等待。 您没有按照应有的方式使用它们。

Output:

 Main link to search for: https://google.com/search?q=Why+do+I+only+see+the+first+4+results%3F
   https://webapps.stackexchange.com/questions/14972/why-on-the-first-page-google-says-there-are-thousands-of-results-but-on-the-last
    title: Why on the first page Google says there are thousands of ...
    https://www.ltnow.com/how-to-get-more-than-10-results-per-page-in-google-search/
    title: 
    https://www.theleverageway.com/blog/how-far-down-the-search-engine-results-page-will-most-people-go/
    title: 
    https://www.impactplus.com/blog/google-is-limiting-number-of-search-results-per-domain-to-have-more-diversity-in-listings
    title: 
    https://www.forbes.com/sites/forbesagencycouncil/2017/10/30/the-value-of-search-results-rankings/
    title: 
    https://www.washingtonpost.com/news/the-intersect/wp/2015/06/30/always-click-the-first-google-result-you-might-want-to-stop-doing-that/
    title: Always click the first Google result? You might want to stop ...
    https://en.wikipedia.org/wiki/First_Four
    title: First Four - Wikipedia
    https://neilpatel.com/blog/first-page-google/
    title: How to Show Up on the First Page of Google (Even if You're a ...
    https://www.searchenginejournal.com/google-first-page-clicks/374516/
    title: Over 25% of People Click the First Google Search Result
    https://www.theleverageway.com/blog/how-far-down-the-search-engine-results-page-will-most-people-go/
    title: How Far Down the Search Results Page Will Most People Go?
    https://www.wordstream.com/blog/ws/2020/08/19/get-on-first-page-google
    title: 10+ Free Ways to Get on the First Page of Google | WordStream
    https://books.google.ca/books?id=teyaAwAAQBAJ&pg=PA102&lpg=PA102&dq=Why+do+I+only+see+the+first+4+results?&source=bl&ots=iBI-YaNJNc&sig=ACfU3U0GpAnPsH_zTbblyRv1C6eS5xwCUg&hl=en&sa=X&ved=2ahUKEwi-psHB_NjwAhURHs0KHWi-AC0Q6AEwD3oECBEQAw
    title: PISA Knowledge and Skills for Life First Results from PISA ...
    https://books.google.ca/books?id=8dY8AQAAQBAJ&pg=PA48&lpg=PA48&dq=Why+do+I+only+see+the+first+4+results?&source=bl&ots=x-7WRKNzXs&sig=ACfU3U13RRTc66oxnpWC6WW-CMwyyIAm8A&hl=en&sa=X&ved=2ahUKEwi-psHB_NjwAhURHs0KHWi-AC0Q6AEwEHoECA8QAw
    title: OECD Skills Outlook 2013 First Results from the Survey of ...
    https://books.google.ca/books?id=zWwVAQAAIAAJ&pg=PA22&lpg=PA22&dq=Why+do+I+only+see+the+first+4+results?&source=bl&ots=u7XMk6B6Qz&sig=ACfU3U2Q8kNocn8W3HHkFxxJnV0b58WYoA&hl=en&sa=X&ved=2ahUKEwi-psHB_NjwAhURHs0KHWi-AC0Q6AEwEXoECBAQAw
    title: Results of the First Joint US-USSR Central Pacific ...

人们也询问的标题不会返回,因为它们有不同的定位器。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM