简体   繁体   English

如何打印google搜索结果的数量(Beautifulsoup)

[英]How to print the number of google search results (Beautifulsoup)

This is the thing I've done so far:这是我到目前为止所做的事情:

import requests
from bs4 import BeautifulSoup

URL = "https://www.google.com/search?q=programming"
r = requests.get(URL) 

soup = BeautifulSoup(r.content, 'html5lib')

table = soup.find('div', attrs = {'id':'result-stats'}) 

print(table)

I want it to get the number of results in an integer that would be the number 1350000000.我希望它获得 integer 中的结果数,即数字 1350000000。

You are missing header User-Agent which is a string to tell the server what kind of device you are accessing the page with.您缺少 header 用户代理,这是一个字符串,用于告诉服务器您正在使用哪种设备访问页面。

import requests
from bs4 import BeautifulSoup

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36"}
URL     = "https://www.google.com/search?q=programming"
result = requests.get(URL, headers=headers)    

soup = BeautifulSoup(result.content, 'html.parser')

total_results_text = soup.find("div", {"id": "result-stats"}).find(text=True, recursive=False) # this will give you the outer text which is like 'About 1,410,000,000 results'
results_num = ''.join([num for num in total_results_text if num.isdigit()]) # now will clean it up and remove all the characters that are not a number .
print(results_num)

This code will do the trick:这段代码可以解决问题:

import requests
from bs4 import BeautifulSoup

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36"}
result = requests.get("https://www.google.com/search?q=programming", headers=headers)

src = result.content
soup = BeautifulSoup(src, 'lxml')

print(soup.find("div", {"id": "result-stats"}))

If you need to extract just one element, use select_one() bs4 method.如果您只需要提取一个元素,请使用select_one() bs4方法。 It's a bit more readable and a bit faster than find() .它比find()更具可读性和速度。 CSS selectors reference . CSS选择器参考

If you need to extract data very fast, try to use selectolax which is a wrapper of lexbor HTML Renderer library written in pure C with no dependencies, and it's fast .如果您需要非常快速地提取数据,请尝试使用selectolax ,它是lexbor HTML 渲染器库的包装,用纯C编写,没有依赖关系,而且速度很快

Code and example in the online IDE : 在线 IDE 中的代码和示例

import requests, lxml
from bs4 import BeautifulSoup

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  "q": "fus ro dah definition",  # query
  "gl": "us",                    # country 
  "hl": "en"                     # language
}

response = requests.get('https://www.google.com/search',
                        headers=headers,
                        params=params)
soup = BeautifulSoup(response.text, 'lxml')

# .previous_sibling will go to, well, previous sibling removing unwanted part: "(0.38 seconds)"
number_of_results = soup.select_one('#result-stats nobr').previous_sibling
print(number_of_results)

# About 107,000 results

Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi.或者,您可以使用来自 SerpApi 的Google Organic Results API来实现相同的目的。 it's a paid API with a free plan.这是一个付费的 API 和免费计划。

The difference in your case is that the only thing that you need to do is to get the data from the structured JSON you want, rather than figuring out how to extract certain elements or how to bypass blocks from Google.您的情况的不同之处在于,您唯一需要做的就是从您想要的结构化 JSON 中获取数据,而不是弄清楚如何提取某些元素或如何绕过 Google 的块。

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "fus ro dah defenition",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

result = results["search_information"]['total_results']
print(result)

# 107000

PS - I wrote a blog post about how to scrape Google Organic Results . PS-我写了一篇关于如何抓取Google Organic Results的博文。

Disclaimer, I work for SerpApi.免责声明,我为 SerpApi 工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM