如何擴大搜索結果的抓取（目前在 Python 中使用 requests 和 bs4）

Question

我使用requests編寫了一些 Python 代碼來嘗試構建搜索結果鏈接的數據庫：

from bs4 import BeautifulSoup
import requests
import re

for i in range(0, 1000, 20):
    url = "https://www.google.com/search?q=inurl%3Agedt.html&ie=utf-8&start=" + i.__str__() + "0&num=20"
    page = requests.get(url)
    if i == 0:
        soup = BeautifulSoup(page.content)
    else:
        soup.append(BeautifulSoup(page.content))

links = soup.findAll("a")

clean_links = []
for link in soup.find_all("a",href=re.compile("(?<=/url\?q=)(htt.*://.*)")):
    print(re.split(":(?=http)",link["href"].replace("/url?q=","")))
    clean_links.append(re.split(":(?=http)", link["href"].replace("/url?q=", "")))

然而，在僅僅 40 個結果之后，谷歌懷疑我是一個機器人並停止提供結果。 這是他們的特權，但有沒有（合法的）方法來解決這個問題？

我可以在requests / bs4中進行某種身份驗證嗎？如果可以，是否有某種帳戶可以讓我支付他們獲取所有 10-20,000 個結果的特權？

Answer 1

繞過阻塞有幾個步驟：

確保您使用請求標頭 user-agent來充當“真實”用戶訪問。 因為默認requests user-agent是python-requests並且網站知道它很可能是發送請求的腳本。 檢查你的user-agent是什么。 使用用戶代理更可靠（但在一定程度上）。
擁有一個user-agen是不夠的，但您可以輪換它們以使其更可靠。
有時僅傳遞user-agent是不夠的。 您可以傳遞額外的標題。 查看更多 HTTP 請求標頭，您可以在發出請求時發送這些標頭。
繞過阻塞的最可靠方法是residential proxies 。 住宅代理允許您選擇特定位置（國家、城市或移動運營商）並以該區域的真實用戶身份瀏覽 web。 代理可以定義為保護用戶免受一般 web 流量影響的中介。 它們充當緩沖區，同時還隱藏您的 IP 地址。
使用非過度使用的代理是最好的選擇。 您可以抓取許多公共代理並將它們保存到list() ，或將其保存到.txt文件以保存 memory 並在請求查看結果時迭代它們，然后轉到不同類型的如果結果不是您想要的，則代理。
你可以被列入白名單。 獲得白名單意味着將 IP 地址添加到網站中的允許列表中，該列表明確允許某些已識別實體訪問特定權限，即默認情況下所有內容都被拒絕時允許的事物列表。 成為白名單的一種方法是，您可以根據抓取的數據定期為“他們”做一些有用的事情，這可能會帶來一些見解。

有關如何繞過阻塞的更多信息，您可以閱讀web 抓取博客文章時減少被阻塞的機會。

您還可以使用status_code檢查響應。 如果發出了錯誤的請求（客戶端錯誤 4XX 或服務器錯誤響應 5XX），則可以使用Response.raise_for_status()引發。 但是如果請求的狀態碼是 200 並且我們調用 raise_for_status() 我們會得到 None。 這意味着沒有錯誤，一切都很好。

html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")

if html.status_code == 200:
    # the rest of the code

在您的代碼中，沒有真正意義： for link in soup.find_all("a",href=re.compile("(?<=/url\?q=)(htt.*://.*)")):因為re.compile可以替換為適當的選擇器，並且可能會提高解析速度，因為不需要執行正則表達式。

此外，您使用循環變量作為start URL 參數的值進行分頁。 我將向您展示另一種使用分頁抓取 Google 搜索結果的方法。 此方法使用相同的start URL 參數，默認等於0 。 0表示第一頁， 10表示第二頁，依此類推。 或者，您可以對 Google 搜索結果使用 SerpApi 分頁，即 API。

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "q": "inurl:gedt.html",
    "hl": "en",         # language
    "gl": "us",         # country of the search, US -> USA
    "start": 0,         # number page by default up to 0
    "filter": 0         # shows more pages. By default filter = 1.
}

此外，默認搜索結果會返回多個頁面。 要增加返回頁面的數量，您需要將filter參數設置為0並將其傳遞給 URL，它將返回更多頁面。 基本上，此參數定義了Similar Results和Omitted Results的過濾器。

您不必保存整個頁面，然后通過更改字符串來查找鏈接。 您可以以更簡單的方式從搜索結果中獲取鏈接。

links = []

for result in soup.select(".tF2Cxc a"):
    links.append(result["href"])

注意：Google 會定期更改選擇器

當 next 按鈕存在時，您需要將["start"]參數值增加 10 以訪問下一頁（ if存在），否則我們需要break while循環：

if soup.select_one('.d6cvqb a[id=pnnext]'):
    params["start"] += 10
else:
    break

在線 IDE 中的代碼和完整示例：

from bs4 import BeautifulSoup
import requests, lxml

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "q": "inurl:gedt.html",
    "hl": "en",         # language
    "gl": "us",         # country of the search, US -> USA
    "start": 0,         # number page by default up to 0
    "filter": 0         # shows more pages. By default filter = 1.
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}

links = []

while True: 
    html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
    soup = BeautifulSoup(html.text, "lxml")

    if html.status_code == 200:
        for result in soup.select(".tF2Cxc a"):
            links.append(result["href"])

        if soup.select_one(".d6cvqb a[id=pnnext]"):
            params["start"] += 10
        else:
            break

for link in links:
    print(link)

Output：

https://www.triton.edu/GE_Certificates/EngineeringTechnologyWeldingCertificate/15.0614-Gedt.html
https://www.triton.edu/GE_Certificates/FacilitiesEngineeringTechnologyCertificate/46.0000-Gedt.html
https://www.triton.edu/GE_Certificates/EngineeringTechnologyDesignCertificate/15.1306-Gedt.html
https://www.triton.edu/GE_Certificates/EngineeringTechnologyFabricationCertificate/15.0499-Gedt.html
https://www.triton.edu/GE_Certificates/BusinessManagementCertificate/52.0201-Gedt.html
https://www.triton.edu/GE_Certificates/GeographicInformationSystemsCertificate/11.0202-Gedt.html
https://www.triton.edu/GE_Certificates/AutomotiveBrakeandSuspensionCertificate/47.0604-Gedt.html
https://www.triton.edu/GE_Certificates/EyeCareAssistantCertificate/51.1803-Gedt.html
https://www.triton.edu/GE_Certificates/InfantToddlerCareCertificate/19.0709-Gedt.html
https://www.triton.edu/GE_Certificates/WebTechnologiesCertificate/11.0801-Gedt.html
... other links

或者，您可以對 Google 搜索結果使用 SerpApi 分頁，即 API。 下面，我演示了一個關於對所有頁面進行分頁和提取鏈接的簡短代碼片段。

from serpapi import GoogleSearch
import os

params = {
    # https://docs.python.org/3/library/os.html#os.getenv
    "api_key": os.getenv("API_KEY"),    # your serpapi api key
    "engine": "google",                 # search engine
    "q": "inurl:gedt.html",             # search query
    "location": "Dallas",               # your location
    # other parameters
}

search = GoogleSearch(params)           # where data extraction happens on the SerpApi backend
pages = search.pagination()             # JSON -> Python dict

links = []

for page in pages:
    for result in page["organic_results"]:
        link = result["link"]
        links.append(link)
        print(link)

output 將是相同的。

免責聲明，我為 SerpApi 工作。

如何擴大搜索結果的抓取（目前在 Python 中使用 requests 和 bs4）

問題描述

1 個解決方案

解決方案1
1 已采納 2022-08-16 20:57:39

如何擴大搜索結果的抓取（目前在 Python 中使用 requests 和 bs4）

問題描述

1 個解決方案

解決方案1 1 已采納 2022-08-16 20:57:39

解決方案1
1 已采納 2022-08-16 20:57:39