[英]How to scrape all the titles and links from Google search results (Python + Selenium)
I'm trying to scrape the titles and links from Google search results using selenium (Python).我正在尝试使用 selenium (Python) 从 Google 搜索结果中抓取标题和链接。 My problem is that I'm only able to scrape the first 4 results, but not the other 6. Here, the results are just empty.
我的问题是我只能抓取前 4 个结果,而不能抓取其他 6 个结果。在这里,结果只是空的。 My feeling is that this might has something to do with the loading time of the web page, but I'm not sure.
我的感觉是,这可能与 web 页面的加载时间有关,但我不确定。 I have been looking at implementing the
wait.until(EC.visibility_of_element_located
statement, but haven't found a way of making it work.我一直在研究实现
wait.until(EC.visibility_of_element_located
语句,但还没有找到让它工作的方法。
Anyone with experience on this issue?有这方面经验的人吗? Much appreciated!
非常感激!
Code:代码:
import urllib
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
root = "https://www.google.com/"
url = "https://google.com/search?q="
query = 'Why do I only see the first 4 results?' # Fill in google query
query = urllib.parse.quote_plus(query)
link = url + query
print(f'Main link to search for: {link}')
options = Options()
options.headless = True
options.add_argument("--window-size=1920,1200")
driver = webdriver.Chrome(options=options)
driver.get(link)
WebDriverWait(driver, 10)
headings = driver.find_elements_by_xpath('//div[@class = "g"]') #Heading elements
for heading in headings:
title = heading.find_elements_by_tag_name('h3')
links = heading.get_attribute('href') # This ain't working either, any help?
print(links)
#link = heading.find_element_by_name('a href')
for t in title:
print('title:', t.text)
You're trying to only obtain div elements with the class "g".您试图仅使用 class "g" 获取 div 元素。 However, by looking at a sample search result myself i have noticed, that not every search result is an element of the class g.
但是,通过查看我自己的示例搜索结果,我注意到并非每个搜索结果都是 class g 的元素。 Some differ.
有些不同。
https://i.imgur.com/QNd6nPm.png https://i.imgur.com/QNd6nPm.png
You need some different kind of selector, eg by iterating through the exact div that contains every "search-result-element" and filter valid ones by checking each elements attributes that match a normal search result.您需要一些不同类型的选择器,例如通过遍历包含每个“搜索结果元素”的确切 div 并通过检查与正常搜索结果匹配的每个元素属性来过滤有效的选择器。
EDIT:编辑:
your try to get the link via the attribute "href" probably doesnt work as well because in my case, search results with the class "g" dont have any direct href attributes.您尝试通过属性“href”获取链接可能不起作用,因为在我的情况下,使用 class“g”的搜索结果没有任何直接的 href 属性。 Theres alway an a-tag followed by a href attribute, like so:
总是有一个 a-tag 后跟一个 href 属性,如下所示:
https://i.imgur.com/NHPcQTn.png https://i.imgur.com/NHPcQTn.png
considering that the first a-tag in a search result is always the one you're looking for, you could search through the sub-elements of your heading for the FIRST a-tag that's found and then get the "href" attribute from it, something like that:考虑到搜索结果中的第一个 a-tag 始终是您要查找的那个,您可以在标题的子元素中搜索找到的第一个 a-tag,然后从中获取“href”属性,类似的东西:
href = heading.find_element_by_tag_name("a").get_attribute("href")
You incorrectly specified locators for links.您错误地为链接指定了定位器。 Solution
解决方案
import urllib
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
root = "https://www.google.com/"
url = "https://google.com/search?q="
query = 'Why do I only see the first 4 results?' # Fill in google query
query = urllib.parse.quote_plus(query)
link = url + query
print(f'Main link to search for: {link}')
options = Options()
# options.headless = True
options.add_argument("--window-size=1920,1200")
driver = webdriver.Chrome(options=options, executable_path='/snap/bin/chromium.chromedriver')
driver.get(link)
wait = WebDriverWait(driver, 15)
wait.until(EC.presence_of_all_elements_located((By.XPATH, '//div[@class = "g"]')))
headings = driver.find_elements_by_xpath('//div[@class = "g"]') # Heading elements
for heading in headings:
title = heading.find_elements_by_tag_name('h3')
links = heading.find_element_by_css_selector('.yuRUbf>a').get_attribute("href") # This ain't working either, any help?
print(links)
# link = heading.find_element_by_name('a href')
for t in title:
print('title:', t.text)
Please note that the only 2 things I fixed were:请注意,我修复的唯一两件事是:
1 The way you get the locator 1 获取定位器的方式
2 Explicit waits. 2 显式等待。 You did not use them as you should have to.
您没有按照应有的方式使用它们。
Output: Output:
Main link to search for: https://google.com/search?q=Why+do+I+only+see+the+first+4+results%3F
https://webapps.stackexchange.com/questions/14972/why-on-the-first-page-google-says-there-are-thousands-of-results-but-on-the-last
title: Why on the first page Google says there are thousands of ...
https://www.ltnow.com/how-to-get-more-than-10-results-per-page-in-google-search/
title:
https://www.theleverageway.com/blog/how-far-down-the-search-engine-results-page-will-most-people-go/
title:
https://www.impactplus.com/blog/google-is-limiting-number-of-search-results-per-domain-to-have-more-diversity-in-listings
title:
https://www.forbes.com/sites/forbesagencycouncil/2017/10/30/the-value-of-search-results-rankings/
title:
https://www.washingtonpost.com/news/the-intersect/wp/2015/06/30/always-click-the-first-google-result-you-might-want-to-stop-doing-that/
title: Always click the first Google result? You might want to stop ...
https://en.wikipedia.org/wiki/First_Four
title: First Four - Wikipedia
https://neilpatel.com/blog/first-page-google/
title: How to Show Up on the First Page of Google (Even if You're a ...
https://www.searchenginejournal.com/google-first-page-clicks/374516/
title: Over 25% of People Click the First Google Search Result
https://www.theleverageway.com/blog/how-far-down-the-search-engine-results-page-will-most-people-go/
title: How Far Down the Search Results Page Will Most People Go?
https://www.wordstream.com/blog/ws/2020/08/19/get-on-first-page-google
title: 10+ Free Ways to Get on the First Page of Google | WordStream
https://books.google.ca/books?id=teyaAwAAQBAJ&pg=PA102&lpg=PA102&dq=Why+do+I+only+see+the+first+4+results?&source=bl&ots=iBI-YaNJNc&sig=ACfU3U0GpAnPsH_zTbblyRv1C6eS5xwCUg&hl=en&sa=X&ved=2ahUKEwi-psHB_NjwAhURHs0KHWi-AC0Q6AEwD3oECBEQAw
title: PISA Knowledge and Skills for Life First Results from PISA ...
https://books.google.ca/books?id=8dY8AQAAQBAJ&pg=PA48&lpg=PA48&dq=Why+do+I+only+see+the+first+4+results?&source=bl&ots=x-7WRKNzXs&sig=ACfU3U13RRTc66oxnpWC6WW-CMwyyIAm8A&hl=en&sa=X&ved=2ahUKEwi-psHB_NjwAhURHs0KHWi-AC0Q6AEwEHoECA8QAw
title: OECD Skills Outlook 2013 First Results from the Survey of ...
https://books.google.ca/books?id=zWwVAQAAIAAJ&pg=PA22&lpg=PA22&dq=Why+do+I+only+see+the+first+4+results?&source=bl&ots=u7XMk6B6Qz&sig=ACfU3U2Q8kNocn8W3HHkFxxJnV0b58WYoA&hl=en&sa=X&ved=2ahUKEwi-psHB_NjwAhURHs0KHWi-AC0Q6AEwEXoECBAQAw
title: Results of the First Joint US-USSR Central Pacific ...
Titles for People also ask are not returned because they have a different locator.人们也询问的标题不会返回,因为它们有不同的定位器。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.