簡體   English   中英

如何從 Google 搜索結果中抓取所有標題和鏈接(Python + Selenium)

[英]How to scrape all the titles and links from Google search results (Python + Selenium)

我正在嘗試使用 selenium (Python) 從 Google 搜索結果中抓取標題和鏈接。 我的問題是我只能抓取前 4 個結果,而不能抓取其他 6 個結果。在這里,結果只是空的。 我的感覺是,這可能與 web 頁面的加載時間有關,但我不確定。 我一直在研究實現wait.until(EC.visibility_of_element_located語句,但還沒有找到讓它工作的方法。

有這方面經驗的人嗎? 非常感激!

代碼:

import urllib
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

root = "https://www.google.com/"
url  = "https://google.com/search?q="

query = 'Why do I only see the first 4 results?' # Fill in google query
query = urllib.parse.quote_plus(query)
link = url + query

print(f'Main link to search for: {link}')

options = Options()
options.headless = True
options.add_argument("--window-size=1920,1200")
driver = webdriver.Chrome(options=options)
driver.get(link)

WebDriverWait(driver, 10)
headings = driver.find_elements_by_xpath('//div[@class = "g"]') #Heading elements
   
for heading in headings:
    
    title = heading.find_elements_by_tag_name('h3')
    links = heading.get_attribute('href') # This ain't working either, any help?
    print(links)
    #link = heading.find_element_by_name('a href')
    for t in title:
         print('title:', t.text)

您試圖僅使用 class "g" 獲取 div 元素。 但是,通過查看我自己的示例搜索結果,我注意到並非每個搜索結果都是 class g 的元素。 有些不同。

https://i.imgur.com/QNd6nPm.png

您需要一些不同類型的選擇器,例如通過遍歷包含每個“搜索結果元素”的確切 div 並通過檢查與正常搜索結果匹配的每個元素屬性來過濾有效的選擇器。

編輯:

您嘗試通過屬性“href”獲取鏈接可能不起作用,因為在我的情況下,使用 class“g”的搜索結果沒有任何直接的 href 屬性。 總是有一個 a-tag 后跟一個 href 屬性,如下所示:

https://i.imgur.com/NHPcQTn.png

考慮到搜索結果中的第一個 a-tag 始終是您要查找的那個,您可以在標題的子元素中搜索找到的第一個 a-tag,然后從中獲取“href”屬性,類似的東西:

href = heading.find_element_by_tag_name("a").get_attribute("href")

您錯誤地為鏈接指定了定位器。 解決方案

import urllib
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

root = "https://www.google.com/"
url = "https://google.com/search?q="

query = 'Why do I only see the first 4 results?'  # Fill in google query
query = urllib.parse.quote_plus(query)
link = url + query

print(f'Main link to search for: {link}')

options = Options()
# options.headless = True
options.add_argument("--window-size=1920,1200")
driver = webdriver.Chrome(options=options, executable_path='/snap/bin/chromium.chromedriver')
driver.get(link)

wait = WebDriverWait(driver, 15)
wait.until(EC.presence_of_all_elements_located((By.XPATH, '//div[@class = "g"]')))
headings = driver.find_elements_by_xpath('//div[@class = "g"]')  # Heading elements

for heading in headings:

    title = heading.find_elements_by_tag_name('h3')
    links = heading.find_element_by_css_selector('.yuRUbf>a').get_attribute("href")  # This ain't working either, any help?
    print(links)
    # link = heading.find_element_by_name('a href')
    for t in title:
        print('title:', t.text)

請注意,我修復的唯一兩件事是:

1 獲取定位器的方式

2 顯式等待。 您沒有按照應有的方式使用它們。

Output:

 Main link to search for: https://google.com/search?q=Why+do+I+only+see+the+first+4+results%3F
   https://webapps.stackexchange.com/questions/14972/why-on-the-first-page-google-says-there-are-thousands-of-results-but-on-the-last
    title: Why on the first page Google says there are thousands of ...
    https://www.ltnow.com/how-to-get-more-than-10-results-per-page-in-google-search/
    title: 
    https://www.theleverageway.com/blog/how-far-down-the-search-engine-results-page-will-most-people-go/
    title: 
    https://www.impactplus.com/blog/google-is-limiting-number-of-search-results-per-domain-to-have-more-diversity-in-listings
    title: 
    https://www.forbes.com/sites/forbesagencycouncil/2017/10/30/the-value-of-search-results-rankings/
    title: 
    https://www.washingtonpost.com/news/the-intersect/wp/2015/06/30/always-click-the-first-google-result-you-might-want-to-stop-doing-that/
    title: Always click the first Google result? You might want to stop ...
    https://en.wikipedia.org/wiki/First_Four
    title: First Four - Wikipedia
    https://neilpatel.com/blog/first-page-google/
    title: How to Show Up on the First Page of Google (Even if You're a ...
    https://www.searchenginejournal.com/google-first-page-clicks/374516/
    title: Over 25% of People Click the First Google Search Result
    https://www.theleverageway.com/blog/how-far-down-the-search-engine-results-page-will-most-people-go/
    title: How Far Down the Search Results Page Will Most People Go?
    https://www.wordstream.com/blog/ws/2020/08/19/get-on-first-page-google
    title: 10+ Free Ways to Get on the First Page of Google | WordStream
    https://books.google.ca/books?id=teyaAwAAQBAJ&pg=PA102&lpg=PA102&dq=Why+do+I+only+see+the+first+4+results?&source=bl&ots=iBI-YaNJNc&sig=ACfU3U0GpAnPsH_zTbblyRv1C6eS5xwCUg&hl=en&sa=X&ved=2ahUKEwi-psHB_NjwAhURHs0KHWi-AC0Q6AEwD3oECBEQAw
    title: PISA Knowledge and Skills for Life First Results from PISA ...
    https://books.google.ca/books?id=8dY8AQAAQBAJ&pg=PA48&lpg=PA48&dq=Why+do+I+only+see+the+first+4+results?&source=bl&ots=x-7WRKNzXs&sig=ACfU3U13RRTc66oxnpWC6WW-CMwyyIAm8A&hl=en&sa=X&ved=2ahUKEwi-psHB_NjwAhURHs0KHWi-AC0Q6AEwEHoECA8QAw
    title: OECD Skills Outlook 2013 First Results from the Survey of ...
    https://books.google.ca/books?id=zWwVAQAAIAAJ&pg=PA22&lpg=PA22&dq=Why+do+I+only+see+the+first+4+results?&source=bl&ots=u7XMk6B6Qz&sig=ACfU3U2Q8kNocn8W3HHkFxxJnV0b58WYoA&hl=en&sa=X&ved=2ahUKEwi-psHB_NjwAhURHs0KHWi-AC0Q6AEwEXoECBAQAw
    title: Results of the First Joint US-USSR Central Pacific ...

人們也詢問的標題不會返回,因為它們有不同的定位器。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM