使用 BeautifulSoup 抓取 Google 搜索結果描述

Question

我想使用 BeautifulSoup 抓取 Google 搜索結果描述，但我無法抓取包含描述的標簽。

祖先：

html
body#gsr.srp.vasq.wf-b
div#main
div#cnt.big
div.mw
div#rcnt
div.col
div#center_col
div#res.med
div#search
div
div#rso
div.g
div.rc
div.IsZvec
div
span.aCOpRe

孩子們

em

蟒蛇代碼：

from bs4 import BeautifulSoup
import requests
import bs4.builder._lxml
import re

search = input("Enter the search term:")
param = {"q": search}

r = requests.get("https://google.com/search?q=", params = param)

soup = BeautifulSoup(r.content, "lxml")
soup.prettify()

title = soup.findAll("div",class_ = "BNeawe vvjwJb AP7Wnd")

for t in title:
    print(t.get_text())

description = soup.findAll("span", class_ = "aCOpRe")

for d in description:
    print(d.get_text())

print("\n")
link = soup.findAll("a")

for link in  soup.find_all("a",href=re.compile("(?<=/url\?q=)(htt.*://.*)")):
    print(re.split(":(?=http)",link["href"].replace("/url?q=","")))

顯示標簽的圖片鏈接

Answer 1

用於 Google 搜索結果片段（描述）的正確 CSS 選擇器是.aCOpRe span:not(.f) 。

這是在線 IDE 中的完整示例。

from bs4 import BeautifulSoup
import requests
import re

param = {"q": "coffee"}
headers = {
    "User-Agent":
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.5 Safari/605.1.15"
}

r = requests.get("https://google.com/search", params=param, headers=headers)

soup = BeautifulSoup(r.content, "lxml")
soup.prettify()

title = soup.select(".DKV0Md span")

for t in title:
    print(f"Title: {t.get_text()}\n")

snippets = soup.select(".aCOpRe span:not(.f)")

for d in snippets:
    print(f"Snippet: {d.get_text()}\n")

link = soup.findAll("a")

for link in soup.find_all("a", href=re.compile("(?<=/url\?q=)(htt.*://.*)")):
    print(re.split(":(?=http)", link["href"].replace("/url?q=", "")))

輸出

Title: Coffee - Wikipedia

Title: Coffee: Benefits, nutrition, and risks - Medical News Today

...

Snippet: Coffee is a brewed drink prepared from roasted coffee beans, the seeds of berries from certain Coffea species. When coffee berries turn from green to bright red in color – indicating ripeness – they are picked, processed, and dried.

Snippet: When people think of coffee, they usually think of its ability to provide an energy boost. ... This article looks at the health benefits of drinking coffee, the evidence ...

...

或者，您可以通過SerpApi從 Google 搜索中提取數據。

curl示例

curl -s 'https://serpapi.com/search?q=coffee&location=Sweden&google_domain=google.se&gl=se&hl=sv&num=100'

Python 示例

from serpapi import GoogleSearch
import os

params = {
    "engine": "google",
    "q": "coffee",
    "location": "Sweden",
    "google_domain": "google.se",
    "gl": "se",
    "hl": "sv",
    "num": 100,
    "api_key": os.getenv("API_KEY")
}

client = GoogleSearch(params)
data = client.get_dict()

print("Organic results")

for result in data['organic_results']:
    print(f"""
Title: {result['title']}
Link: {result['link']}
Position: {result['position']}
Snippet: {result['snippet']}
""")

輸出

Organic results

Title: Coffee - Wikipedia
Link: https://en.wikipedia.org/wiki/Coffee
Position: 1
Snippet: Coffee is a brewed drink prepared from roasted coffee beans, the seeds of berries from certain Coffea species. When coffee berries turn from green to bright red ...


Title: Drop Coffee
Link: https://www.dropcoffee.com/
Position: 2
Snippet: Drop Coffee is an award winning roastery in Stockholm, representing Sweden four times in the World Coffee Roasting Championship, placing second, third and ...

...

免責聲明：我在 SerpApi 工作。

Answer 2

您可能想嘗試使用CSS選擇器，然后將文本拉出。

例如：

import requests
from bs4 import BeautifulSoup


page = requests.get("https://www.google.com/search?q=scrap").text
soup = BeautifulSoup(page, "html.parser").select(".s3v9rd.AP7Wnd")

for item in soup:
    print(item.getText(strip=True))

scrap樣本輸出：

丟棄或停止使用（多余的、舊的或不工作的車輛、船只或機器），尤其是為了將其轉化為廢金屬。

Answer 3

這是我的解決方案：代碼獲取谷歌搜索結果的所有標題、鏈接、面包屑和描述（沒有特色部分，有人問）你可以在搜索某些東西時看到

 query = "Your search term"
    driver_location = "C:\Program Files (x86)\chromedriver.exe"
    options = webdriver.ChromeOptions()
    options.add_argument('--lang=en,en_US')
    # options.add_argument('--disable-gpu')
    # options.add_argument('--no-sandbox')
  options.add_argument('Accept=text/html,application/xhtml+xml,application/xml;q=0.9,i

mage/webp')
# options.add_argument('Accept-Encoding= gzip')
# options.add_argument('Accept-Language= en-US,en;q=0.9,es;q=0.8')
# options.add_argument('Upgrade-Insecure-Requests: 1')
# options.add_argument('image/apng,*/*;q=0.8,application/signed-exchange;v=b3')
# options.add_argument('user-agent=' + ua['google chrome'])
# options.add_argument('proxy-server=' + "115.42.65.14:8080")
# options.add_argument('Referer=' + "https://www.google.com/")
driver = webdriver.Chrome(executable_path=driver_location,chrome_options=options)

driver.get("https://www.google.com/search?q={}&oq={}&hl=en&num=50".format(urllib.parse.quote(query),urllib.parse.quote(query)))
p = driver.find_elements_by_class_name("tF2Cxc")
titles = driver.find_elements_by_class_name("yuRUbf")
descriptions = driver.find_elements_by_class_name("IsZvec")
time.sleep(10)

link_list = []
description_list = []
featured = False
featured_links = 0
title_list = []
featured_max = 0
featured_num = 0

for index in range(len(p)):
    p_items = p[index].get_attribute("innerHTML")
    print(p_items)
    items_soup = BeautifulSoup(p_items,"html.parser")
    if(featured==False):
        if((len(items_soup.text.split("\n")) != 2)):
            print(items_soup.text.split("\n"))
            if ((items_soup.select(".IsZvec") != None) and 
                   (items_soup.select(".IsZvec")[0].text != "") and (items_soup.select(".IsZvec") != "")):
                a = items_soup.select("a",recursive=False)[0]["href"]
                print(a)
                link_list.append(a)
    title_list.append(titles[index].text)
    description_list.append(descriptions[index].text)
description_list_new = []
title_list_new = []
for index in range(len(description_list)):
    if (description_list[index] == ""):
        pass
    elif (re.findall(r'<\w{1,}\s\w{1,}>',description_list[index]) != []):
        pass
    else:
        description_list_new.append(description_list[index])
        title_list_new.append(title_list[index])
description_list = description_list_new
title_list = title_list_new

for title in range(len(title_list)):
    print(title_list[title])
    print(description_list[title])
    print("=======================")
print(link_list)
print(len(title_list))
print(len(link_list))

使用 BeautifulSoup 抓取 Google 搜索結果描述

問題描述

3 個解決方案

解決方案1
1 2021-01-18 14:17:41

解決方案2
0 2020-11-18 09:20:40

解決方案3
0 2021-06-03 09:09:25

使用 BeautifulSoup 抓取 Google 搜索結果描述

問題描述

3 個解決方案

解決方案1 1 2021-01-18 14:17:41

解決方案2 0 2020-11-18 09:20:40

解決方案3 0 2021-06-03 09:09:25

解決方案1
1 2021-01-18 14:17:41

解決方案2
0 2020-11-18 09:20:40

解決方案3
0 2021-06-03 09:09:25