简体   繁体   中英

How to scale up scraping of search results (currently using requests and bs4 in Python)

I wrote some Python code using requests to try to build a database of search result links:

from bs4 import BeautifulSoup
import requests
import re

for i in range(0, 1000, 20):
    url = "https://www.google.com/search?q=inurl%3Agedt.html&ie=utf-8&start=" + i.__str__() + "0&num=20"
    page = requests.get(url)
    if i == 0:
        soup = BeautifulSoup(page.content)
    else:
        soup.append(BeautifulSoup(page.content))

links = soup.findAll("a")

clean_links = []
for link in soup.find_all("a",href=re.compile("(?<=/url\?q=)(htt.*://.*)")):
    print(re.split(":(?=http)",link["href"].replace("/url?q=","")))
    clean_links.append(re.split(":(?=http)", link["href"].replace("/url?q=", "")))

However, after only 40 results Google suspected me of being a robot and quit providing results. That's their prerogative, but is there a (legitimate) way of getting around this?

Can I have some sort of authentication in requests / bs4 and, if so, is there some kind of account that lets me pay them for the privilege of scraping all 10-20,000 results?

There are several steps to bypass blocking:

  1. Make sure you're using request headers user-agent to act as a "real" user visit. Because default requests user-agent is python-requests and websites understand that it's most likely a script that sends a request. Check what's your user-agent . Using the User Agent is more reliable (but up to a certain point).
  2. Having one user-agen t is not enough but you can rotate them to make it a bit more reliable.
  3. Sometimes passing only user-agent isn't enough. You can pass additional headers . See more HTTP request headers that you can send while making a request.
  4. The most reliable way to bypass blocking is residential proxies . Residential proxies allow you to choose a specific location (country, city, or mobile carrier) and surf the web as a real user in that area. Proxies can be defined as intermediaries that protect users from general web traffic. They act as buffers while also concealing your IP address.
  5. Using a non-overused proxies is the best option. You can scrape a lot of public proxies and save them to a list() , or save it to .txt file to save memory and iterate over them while making a request to see what's the results would be, and then move to different types of proxies if the result is not what you were looking for.
  6. You can be whitelisted . Get whitelisted means to add IP addresses to allow lists in website which explicitly allows some identified entities to access a particular privilege, ie it is a list of things allowed when everything is denied by default. One of the ways to become whitelisted is you can regularly do something useful for "them" based on scraped data which could lead to some insights.

For more information on how to bypass blocking, you can read the Reducing the chance of being blocked while web scraping blog post.

You can also check the response with status_code . If a bad request was made (client error 4XX or server error response 5XX), then it can be raised with Response.raise_for_status() . But if the status code for the request is 200 and we call raise_for_status() we get None. This means that there are no errors and everything is fine.

html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")

if html.status_code == 200:
    # the rest of the code

In your code, there's no real point in: for link in soup.find_all("a",href=re.compile("(?<=/url\?q=)(htt.*://.*)")): as re.compile could be replaced with proper selector and might improve parsing speed as there's no need to do regex.

Also, you were paginating using the loop variable as the value of the start URL parameter. I'll show you another way to scrape Google search results with pagination. This method uses the same start URL parameter which is equal to 0 by default. 0 means the first page, 10 is for the second, and so on. Or you can use SerpApi pagination for Google Search results which is an API.

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "q": "inurl:gedt.html",
    "hl": "en",         # language
    "gl": "us",         # country of the search, US -> USA
    "start": 0,         # number page by default up to 0
    "filter": 0         # shows more pages. By default filter = 1.
}

Also, default search results return several pages. To increase the number of returned pages, you need to set the filter parameter to 0 and pass it to the URL which will return more pages. Basically, this parameter defines the filters for Similar Results and Omitted Results .

You don't have to save the entire page and then look for links by changing the string. You can get links from search results in a simpler way.

links = []

for result in soup.select(".tF2Cxc a"):
    links.append(result["href"])

Note: Google periodically changes selectors

While the next button exists, you need to increment the ["start"] parameter value by 10 to access the next page if it's present, otherwise we need to break out of the while loop:

if soup.select_one('.d6cvqb a[id=pnnext]'):
    params["start"] += 10
else:
    break

Code and full example in online IDE :

from bs4 import BeautifulSoup
import requests, lxml

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "q": "inurl:gedt.html",
    "hl": "en",         # language
    "gl": "us",         # country of the search, US -> USA
    "start": 0,         # number page by default up to 0
    "filter": 0         # shows more pages. By default filter = 1.
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}

links = []

while True: 
    html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
    soup = BeautifulSoup(html.text, "lxml")

    if html.status_code == 200:
        for result in soup.select(".tF2Cxc a"):
            links.append(result["href"])

        if soup.select_one(".d6cvqb a[id=pnnext]"):
            params["start"] += 10
        else:
            break

for link in links:
    print(link)

Output:

https://www.triton.edu/GE_Certificates/EngineeringTechnologyWeldingCertificate/15.0614-Gedt.html
https://www.triton.edu/GE_Certificates/FacilitiesEngineeringTechnologyCertificate/46.0000-Gedt.html
https://www.triton.edu/GE_Certificates/EngineeringTechnologyDesignCertificate/15.1306-Gedt.html
https://www.triton.edu/GE_Certificates/EngineeringTechnologyFabricationCertificate/15.0499-Gedt.html
https://www.triton.edu/GE_Certificates/BusinessManagementCertificate/52.0201-Gedt.html
https://www.triton.edu/GE_Certificates/GeographicInformationSystemsCertificate/11.0202-Gedt.html
https://www.triton.edu/GE_Certificates/AutomotiveBrakeandSuspensionCertificate/47.0604-Gedt.html
https://www.triton.edu/GE_Certificates/EyeCareAssistantCertificate/51.1803-Gedt.html
https://www.triton.edu/GE_Certificates/InfantToddlerCareCertificate/19.0709-Gedt.html
https://www.triton.edu/GE_Certificates/WebTechnologiesCertificate/11.0801-Gedt.html
... other links

Or you can use SerpApi pagination for Google Search results which is an API. Below, I demonstrate a short code snippet about pagination all pages and extracting links.

from serpapi import GoogleSearch
import os

params = {
    # https://docs.python.org/3/library/os.html#os.getenv
    "api_key": os.getenv("API_KEY"),    # your serpapi api key
    "engine": "google",                 # search engine
    "q": "inurl:gedt.html",             # search query
    "location": "Dallas",               # your location
    # other parameters
}

search = GoogleSearch(params)           # where data extraction happens on the SerpApi backend
pages = search.pagination()             # JSON -> Python dict

links = []

for page in pages:
    for result in page["organic_results"]:
        link = result["link"]
        links.append(link)
        print(link)

The output will be the same.

Disclaimer, I work for SerpApi.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM