简体   繁体   中英

using selenium to get google search results without detection

I am trying to make a custom Index Check utility to check which URLs have been indexed by Google using Python and selenium

I need to get the Google search results so that I can check weather the queried url exists in the results or not. I am able to get 50 to 60 results before getting Google Captcha.

Below is my concerned code

options = webdriver.FirefoxOptions()
options.set_headless()

driver = webdriver.Firefox(executable_path=r'./geckodriver', firefox_options=options)

urls = [line.strip() for line in open('urls.txt', 'r')]

url_search = "https://www.google.com/search?"

for c, link in enumerate(urls):

    query = {'q': link}
    full_url = url_search + urlencode(query)

    driver.get(full_url)
    soup = BeautifulSoup(driver.page_source, 'html.parser')

I've tried using ChromeDriver and gecko driver both in headless mode but got the same result.

My main concern is how can I use selenium without getting detected?

I know google doesn't allow scraping but there are some paid APIs which does exactly the same thing ie providing Google Search Results. How are they working??!!

I've also searched for Google APIs but can't find one for my use case.

Also, if google doesn't allow scraping, then why does it let scrapers scrape for a limited number of times?

Thanks for your time, I really appreciate it.

If a website doesn't want you to scrape it, there's usually little you can do, especially for the likes of Google or Amazon. In fact, it's also a matter of whether you should be doing it at all.

I know google doesn't allow scraping but there are some paid APIs which does exactly the same thing ie providing Google Search Results. How are they working??!!

They use similar to tools to the ones you're using, just in a larger scale. An example is multiple scraping agents in containers, each using a different proxy until they are detected. The agents then combine their findings and they are restarted to scrape further.

Also, if google doesn't allow scraping, then why does it let scrapers scrape for a limited number of times?

This can happen as it could take some time to tell for sure that a bot is being used. Moreover, you might need to scrape for a while before they decide that you're abusing their service.

However, there are a couple of things you could try. You could use a User-Agent with Selenium and include this in your options: options.add_argument('--disable-blink-features=AutomationControlled') . The latter does miracles for some websites on Chrome with Selenium, but I am not sure if it's the same with Firefox.

There is not really that much you can do to bypass Googles CAPTCHA. You can try changing the User-Agent and some other properties. This article may help you.

To your last question, Google appears to have a search API that you can use for free (of course there is a paid plan as well). Here is a blog post about it.

You can use requests and bs4 library instead of selenium since everything in Google Search Results is located in the HTML.

Make sure you're using user-agent to fake real user visit because if you're using requests library, the default user-agent will be python-requests , we need to avoid it.

Let's say you want to scrape the Title and URL from that title, example in online IDE :

from bs4 import BeautifulSoup
import requests, lxml

# Faking real user visit.
headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}

# Search query.
params = {'q': 'ice cream'}


html = requests.get(f'https://www.google.com/search?q=',
                      headers=headers,
                      params=params).text

# select() uses CSS selectors. It's like findAll() or find_all(), you can iterate over it.
# if you want to scrape just one element, you can use select_one() method instead.
for result in soup.select('.yuRUbf'):
  title = result.select_one('.DKV0Md').text
  link = result.select_one('a')['href']
  print(f'{title}\n{link}\n')

Alternatively, you can achieve these results using Google Search Engine Results API from SerpApi. It's a paid API with a free trial of 5,000 searches.

Code to integrate and example in online IDE :

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "ice cream",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

# Iterates over JSON output and prints Title, Snippet (summary) and link on the new line
for result in results["organic_results"]:
  print(f"Title: {result['title']}\nSummary: {result['snippet']}\nLink: {result['link']}\n")

Disclaimer, I work for SerpApi.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM