简体   繁体   English

如何扩大搜索结果的抓取(目前在 Python 中使用 requests 和 bs4)

[英]How to scale up scraping of search results (currently using requests and bs4 in Python)

I wrote some Python code using requests to try to build a database of search result links:我使用requests编写了一些 Python 代码来尝试构建搜索结果链接的数据库:

from bs4 import BeautifulSoup
import requests
import re

for i in range(0, 1000, 20):
    url = "https://www.google.com/search?q=inurl%3Agedt.html&ie=utf-8&start=" + i.__str__() + "0&num=20"
    page = requests.get(url)
    if i == 0:
        soup = BeautifulSoup(page.content)
    else:
        soup.append(BeautifulSoup(page.content))

links = soup.findAll("a")

clean_links = []
for link in soup.find_all("a",href=re.compile("(?<=/url\?q=)(htt.*://.*)")):
    print(re.split(":(?=http)",link["href"].replace("/url?q=","")))
    clean_links.append(re.split(":(?=http)", link["href"].replace("/url?q=", "")))

However, after only 40 results Google suspected me of being a robot and quit providing results.然而,在仅仅 40 个结果之后,谷歌怀疑我是一个机器人并停止提供结果。 That's their prerogative, but is there a (legitimate) way of getting around this?这是他们的特权,但有没有(合法的)方法来解决这个问题?

Can I have some sort of authentication in requests / bs4 and, if so, is there some kind of account that lets me pay them for the privilege of scraping all 10-20,000 results?我可以在requests / bs4中进行某种身份验证吗?如果可以,是否有某种帐户可以让我支付他们获取所有 10-20,000 个结果的特权?

There are several steps to bypass blocking:绕过阻塞有几个步骤:

  1. Make sure you're using request headers user-agent to act as a "real" user visit.确保您使用请求标头user-agent来充当“真实”用户访问。 Because default requests user-agent is python-requests and websites understand that it's most likely a script that sends a request.因为默认requests user-agentpython-requests并且网站知道它很可能是发送请求的脚本。 Check what's your user-agent . 检查你的user-agent是什么 Using the User Agent is more reliable (but up to a certain point).使用用户代理更可靠(但在一定程度上)。
  2. Having one user-agen t is not enough but you can rotate them to make it a bit more reliable.拥有一个user-agen是不够的,但您可以轮换它们以使其更可靠。
  3. Sometimes passing only user-agent isn't enough.有时仅传递user-agent是不够的。 You can pass additional headers . 您可以传递额外的标题 See more HTTP request headers that you can send while making a request.查看更多 HTTP 请求标头,您可以在发出请求时发送这些标头。
  4. The most reliable way to bypass blocking is residential proxies .绕过阻塞的最可靠方法是residential proxies Residential proxies allow you to choose a specific location (country, city, or mobile carrier) and surf the web as a real user in that area.住宅代理允许您选择特定位置(国家、城市或移动运营商)并以该区域的真实用户身份浏览 web。 Proxies can be defined as intermediaries that protect users from general web traffic.代理可以定义为保护用户免受一般 web 流量影响的中介。 They act as buffers while also concealing your IP address.它们充当缓冲区,同时还隐藏您的 IP 地址。
  5. Using a non-overused proxies is the best option.使用非过度使用的代理是最好的选择。 You can scrape a lot of public proxies and save them to a list() , or save it to .txt file to save memory and iterate over them while making a request to see what's the results would be, and then move to different types of proxies if the result is not what you were looking for.您可以抓取许多公共代理并将它们保存到list() ,或将其保存到.txt文件以保存 memory 并在请求查看结果时迭代它们,然后转到不同类型的如果结果不是您想要的,则代理。
  6. You can be whitelisted . 你可以被列入白名单 Get whitelisted means to add IP addresses to allow lists in website which explicitly allows some identified entities to access a particular privilege, ie it is a list of things allowed when everything is denied by default.获得白名单意味着将 IP 地址添加到网站中的允许列表中,该列表明确允许某些已识别实体访问特定权限,即默认情况下所有内容都被拒绝时允许的事物列表。 One of the ways to become whitelisted is you can regularly do something useful for "them" based on scraped data which could lead to some insights.成为白名单的一种方法是,您可以根据抓取的数据定期为“他们”做一些有用的事情,这可能会带来一些见解。

For more information on how to bypass blocking, you can read the Reducing the chance of being blocked while web scraping blog post.有关如何绕过阻塞的更多信息,您可以阅读web 抓取博客文章时减少被阻塞的机会

You can also check the response with status_code .您还可以使用status_code检查响应。 If a bad request was made (client error 4XX or server error response 5XX), then it can be raised with Response.raise_for_status() .如果发出了错误的请求(客户端错误 4XX 或服务器错误响应 5XX),则可以使用Response.raise_for_status()引发。 But if the status code for the request is 200 and we call raise_for_status() we get None.但是如果请求的状态码是 200 并且我们调用 raise_for_status() 我们会得到 None。 This means that there are no errors and everything is fine.这意味着没有错误,一切都很好。

html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")

if html.status_code == 200:
    # the rest of the code

In your code, there's no real point in: for link in soup.find_all("a",href=re.compile("(?<=/url\?q=)(htt.*://.*)")): as re.compile could be replaced with proper selector and might improve parsing speed as there's no need to do regex.在您的代码中,没有真正意义: for link in soup.find_all("a",href=re.compile("(?<=/url\?q=)(htt.*://.*)")):因为re.compile可以替换为适当的选择器,并且可能会提高解析速度,因为不需要执行正则表达式。

Also, you were paginating using the loop variable as the value of the start URL parameter.此外,您使用循环变量作为start URL 参数的值进行分页。 I'll show you another way to scrape Google search results with pagination.我将向您展示另一种使用分页抓取 Google 搜索结果的方法。 This method uses the same start URL parameter which is equal to 0 by default.此方法使用相同的start URL 参数,默认等于0 0 means the first page, 10 is for the second, and so on. 0表示第一页, 10表示第二页,依此类推。 Or you can use SerpApi pagination for Google Search results which is an API.或者,您可以对 Google 搜索结果使用 SerpApi 分页,即 API。

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "q": "inurl:gedt.html",
    "hl": "en",         # language
    "gl": "us",         # country of the search, US -> USA
    "start": 0,         # number page by default up to 0
    "filter": 0         # shows more pages. By default filter = 1.
}

Also, default search results return several pages.此外,默认搜索结果会返回多个页面。 To increase the number of returned pages, you need to set the filter parameter to 0 and pass it to the URL which will return more pages.要增加返回页面的数量,您需要将filter参数设置为0并将其传递给 URL,它将返回更多页面。 Basically, this parameter defines the filters for Similar Results and Omitted Results .基本上,此参数定义了Similar ResultsOmitted Results的过滤器。

You don't have to save the entire page and then look for links by changing the string.您不必保存整个页面,然后通过更改字符串来查找链接。 You can get links from search results in a simpler way.您可以以更简单的方式从搜索结果中获取链接。

links = []

for result in soup.select(".tF2Cxc a"):
    links.append(result["href"])

Note: Google periodically changes selectors注意:Google 会定期更改选择器

While the next button exists, you need to increment the ["start"] parameter value by 10 to access the next page if it's present, otherwise we need to break out of the while loop:当 next 按钮存在时,您需要将["start"]参数值增加 10 以访问下一页( if存在),否则我们需要break while循环:

if soup.select_one('.d6cvqb a[id=pnnext]'):
    params["start"] += 10
else:
    break

Code and full example in online IDE : 在线 IDE 中的代码和完整示例

from bs4 import BeautifulSoup
import requests, lxml

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "q": "inurl:gedt.html",
    "hl": "en",         # language
    "gl": "us",         # country of the search, US -> USA
    "start": 0,         # number page by default up to 0
    "filter": 0         # shows more pages. By default filter = 1.
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}

links = []

while True: 
    html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
    soup = BeautifulSoup(html.text, "lxml")

    if html.status_code == 200:
        for result in soup.select(".tF2Cxc a"):
            links.append(result["href"])

        if soup.select_one(".d6cvqb a[id=pnnext]"):
            params["start"] += 10
        else:
            break

for link in links:
    print(link)

Output: Output:

https://www.triton.edu/GE_Certificates/EngineeringTechnologyWeldingCertificate/15.0614-Gedt.html
https://www.triton.edu/GE_Certificates/FacilitiesEngineeringTechnologyCertificate/46.0000-Gedt.html
https://www.triton.edu/GE_Certificates/EngineeringTechnologyDesignCertificate/15.1306-Gedt.html
https://www.triton.edu/GE_Certificates/EngineeringTechnologyFabricationCertificate/15.0499-Gedt.html
https://www.triton.edu/GE_Certificates/BusinessManagementCertificate/52.0201-Gedt.html
https://www.triton.edu/GE_Certificates/GeographicInformationSystemsCertificate/11.0202-Gedt.html
https://www.triton.edu/GE_Certificates/AutomotiveBrakeandSuspensionCertificate/47.0604-Gedt.html
https://www.triton.edu/GE_Certificates/EyeCareAssistantCertificate/51.1803-Gedt.html
https://www.triton.edu/GE_Certificates/InfantToddlerCareCertificate/19.0709-Gedt.html
https://www.triton.edu/GE_Certificates/WebTechnologiesCertificate/11.0801-Gedt.html
... other links

Or you can use SerpApi pagination for Google Search results which is an API.或者,您可以对 Google 搜索结果使用 SerpApi 分页,即 API。 Below, I demonstrate a short code snippet about pagination all pages and extracting links.下面,我演示了一个关于对所有页面进行分页和提取链接的简短代码片段。

from serpapi import GoogleSearch
import os

params = {
    # https://docs.python.org/3/library/os.html#os.getenv
    "api_key": os.getenv("API_KEY"),    # your serpapi api key
    "engine": "google",                 # search engine
    "q": "inurl:gedt.html",             # search query
    "location": "Dallas",               # your location
    # other parameters
}

search = GoogleSearch(params)           # where data extraction happens on the SerpApi backend
pages = search.pagination()             # JSON -> Python dict

links = []

for page in pages:
    for result in page["organic_results"]:
        link = result["link"]
        links.append(link)
        print(link)

The output will be the same. output 将是相同的。

Disclaimer, I work for SerpApi.免责声明,我为 SerpApi 工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM