如何扩大搜索结果的抓取（目前在 Python 中使用 requests 和 bs4）

Question

我使用requests编写了一些 Python 代码来尝试构建搜索结果链接的数据库：

from bs4 import BeautifulSoup
import requests
import re

for i in range(0, 1000, 20):
    url = "https://www.google.com/search?q=inurl%3Agedt.html&ie=utf-8&start=" + i.__str__() + "0&num=20"
    page = requests.get(url)
    if i == 0:
        soup = BeautifulSoup(page.content)
    else:
        soup.append(BeautifulSoup(page.content))

links = soup.findAll("a")

clean_links = []
for link in soup.find_all("a",href=re.compile("(?<=/url\?q=)(htt.*://.*)")):
    print(re.split(":(?=http)",link["href"].replace("/url?q=","")))
    clean_links.append(re.split(":(?=http)", link["href"].replace("/url?q=", "")))

然而，在仅仅 40 个结果之后，谷歌怀疑我是一个机器人并停止提供结果。 这是他们的特权，但有没有（合法的）方法来解决这个问题？

我可以在requests / bs4中进行某种身份验证吗？如果可以，是否有某种帐户可以让我支付他们获取所有 10-20,000 个结果的特权？

Answer 1

绕过阻塞有几个步骤：

确保您使用请求标头 user-agent来充当“真实”用户访问。 因为默认requests user-agent是python-requests并且网站知道它很可能是发送请求的脚本。 检查你的user-agent是什么。 使用用户代理更可靠（但在一定程度上）。
拥有一个user-agen是不够的，但您可以轮换它们以使其更可靠。
有时仅传递user-agent是不够的。 您可以传递额外的标题。 查看更多 HTTP 请求标头，您可以在发出请求时发送这些标头。
绕过阻塞的最可靠方法是residential proxies 。 住宅代理允许您选择特定位置（国家、城市或移动运营商）并以该区域的真实用户身份浏览 web。 代理可以定义为保护用户免受一般 web 流量影响的中介。 它们充当缓冲区，同时还隐藏您的 IP 地址。
使用非过度使用的代理是最好的选择。 您可以抓取许多公共代理并将它们保存到list() ，或将其保存到.txt文件以保存 memory 并在请求查看结果时迭代它们，然后转到不同类型的如果结果不是您想要的，则代理。
你可以被列入白名单。 获得白名单意味着将 IP 地址添加到网站中的允许列表中，该列表明确允许某些已识别实体访问特定权限，即默认情况下所有内容都被拒绝时允许的事物列表。 成为白名单的一种方法是，您可以根据抓取的数据定期为“他们”做一些有用的事情，这可能会带来一些见解。

有关如何绕过阻塞的更多信息，您可以阅读web 抓取博客文章时减少被阻塞的机会。

您还可以使用status_code检查响应。 如果发出了错误的请求（客户端错误 4XX 或服务器错误响应 5XX），则可以使用Response.raise_for_status()引发。 但是如果请求的状态码是 200 并且我们调用 raise_for_status() 我们会得到 None。 这意味着没有错误，一切都很好。

html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")

if html.status_code == 200:
    # the rest of the code

在您的代码中，没有真正意义： for link in soup.find_all("a",href=re.compile("(?<=/url\?q=)(htt.*://.*)")):因为re.compile可以替换为适当的选择器，并且可能会提高解析速度，因为不需要执行正则表达式。

此外，您使用循环变量作为start URL 参数的值进行分页。 我将向您展示另一种使用分页抓取 Google 搜索结果的方法。 此方法使用相同的start URL 参数，默认等于0 。 0表示第一页， 10表示第二页，依此类推。 或者，您可以对 Google 搜索结果使用 SerpApi 分页，即 API。

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "q": "inurl:gedt.html",
    "hl": "en",         # language
    "gl": "us",         # country of the search, US -> USA
    "start": 0,         # number page by default up to 0
    "filter": 0         # shows more pages. By default filter = 1.
}

此外，默认搜索结果会返回多个页面。 要增加返回页面的数量，您需要将filter参数设置为0并将其传递给 URL，它将返回更多页面。 基本上，此参数定义了Similar Results和Omitted Results的过滤器。

您不必保存整个页面，然后通过更改字符串来查找链接。 您可以以更简单的方式从搜索结果中获取链接。

links = []

for result in soup.select(".tF2Cxc a"):
    links.append(result["href"])

注意：Google 会定期更改选择器

当 next 按钮存在时，您需要将["start"]参数值增加 10 以访问下一页（ if存在），否则我们需要break while循环：

if soup.select_one('.d6cvqb a[id=pnnext]'):
    params["start"] += 10
else:
    break

在线 IDE 中的代码和完整示例：

from bs4 import BeautifulSoup
import requests, lxml

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "q": "inurl:gedt.html",
    "hl": "en",         # language
    "gl": "us",         # country of the search, US -> USA
    "start": 0,         # number page by default up to 0
    "filter": 0         # shows more pages. By default filter = 1.
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}

links = []

while True: 
    html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
    soup = BeautifulSoup(html.text, "lxml")

    if html.status_code == 200:
        for result in soup.select(".tF2Cxc a"):
            links.append(result["href"])

        if soup.select_one(".d6cvqb a[id=pnnext]"):
            params["start"] += 10
        else:
            break

for link in links:
    print(link)

Output：

https://www.triton.edu/GE_Certificates/EngineeringTechnologyWeldingCertificate/15.0614-Gedt.html
https://www.triton.edu/GE_Certificates/FacilitiesEngineeringTechnologyCertificate/46.0000-Gedt.html
https://www.triton.edu/GE_Certificates/EngineeringTechnologyDesignCertificate/15.1306-Gedt.html
https://www.triton.edu/GE_Certificates/EngineeringTechnologyFabricationCertificate/15.0499-Gedt.html
https://www.triton.edu/GE_Certificates/BusinessManagementCertificate/52.0201-Gedt.html
https://www.triton.edu/GE_Certificates/GeographicInformationSystemsCertificate/11.0202-Gedt.html
https://www.triton.edu/GE_Certificates/AutomotiveBrakeandSuspensionCertificate/47.0604-Gedt.html
https://www.triton.edu/GE_Certificates/EyeCareAssistantCertificate/51.1803-Gedt.html
https://www.triton.edu/GE_Certificates/InfantToddlerCareCertificate/19.0709-Gedt.html
https://www.triton.edu/GE_Certificates/WebTechnologiesCertificate/11.0801-Gedt.html
... other links

或者，您可以对 Google 搜索结果使用 SerpApi 分页，即 API。 下面，我演示了一个关于对所有页面进行分页和提取链接的简短代码片段。

from serpapi import GoogleSearch
import os

params = {
    # https://docs.python.org/3/library/os.html#os.getenv
    "api_key": os.getenv("API_KEY"),    # your serpapi api key
    "engine": "google",                 # search engine
    "q": "inurl:gedt.html",             # search query
    "location": "Dallas",               # your location
    # other parameters
}

search = GoogleSearch(params)           # where data extraction happens on the SerpApi backend
pages = search.pagination()             # JSON -> Python dict

links = []

for page in pages:
    for result in page["organic_results"]:
        link = result["link"]
        links.append(link)
        print(link)

output 将是相同的。

免责声明，我为 SerpApi 工作。

如何扩大搜索结果的抓取（目前在 Python 中使用 requests 和 bs4）

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-08-16 20:57:39

如何扩大搜索结果的抓取（目前在 Python 中使用 requests 和 bs4）

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-08-16 20:57:39

解决方案1
1 已采纳 2022-08-16 20:57:39