簡體   English   中英

使用 BeautifulSoup 抓取 Google 搜索結果描述

[英]Scrape Google Search Result Description Using BeautifulSoup

我想使用 BeautifulSoup 抓取 Google 搜索結果描述,但我無法抓取包含描述的標簽。

祖先:

html
body#gsr.srp.vasq.wf-b
div#main
div#cnt.big
div.mw
div#rcnt
div.col
div#center_col
div#res.med
div#search
div
div#rso
div.g
div.rc
div.IsZvec
div
span.aCOpRe

孩子們

em

蟒蛇代碼:

from bs4 import BeautifulSoup
import requests
import bs4.builder._lxml
import re

search = input("Enter the search term:")
param = {"q": search}

r = requests.get("https://google.com/search?q=", params = param)

soup = BeautifulSoup(r.content, "lxml")
soup.prettify()

title = soup.findAll("div",class_ = "BNeawe vvjwJb AP7Wnd")

for t in title:
    print(t.get_text())

description = soup.findAll("span", class_ = "aCOpRe")

for d in description:
    print(d.get_text())

print("\n")
link = soup.findAll("a")

for link in  soup.find_all("a",href=re.compile("(?<=/url\?q=)(htt.*://.*)")):
    print(re.split(":(?=http)",link["href"].replace("/url?q=","")))

顯示標簽的圖片鏈接

用於 Google 搜索結果片段(描述)的正確 CSS 選擇器是.aCOpRe span:not(.f)

這是在線 IDE 中的完整示例

from bs4 import BeautifulSoup
import requests
import re

param = {"q": "coffee"}
headers = {
    "User-Agent":
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.5 Safari/605.1.15"
}

r = requests.get("https://google.com/search", params=param, headers=headers)

soup = BeautifulSoup(r.content, "lxml")
soup.prettify()

title = soup.select(".DKV0Md span")

for t in title:
    print(f"Title: {t.get_text()}\n")

snippets = soup.select(".aCOpRe span:not(.f)")

for d in snippets:
    print(f"Snippet: {d.get_text()}\n")

link = soup.findAll("a")

for link in soup.find_all("a", href=re.compile("(?<=/url\?q=)(htt.*://.*)")):
    print(re.split(":(?=http)", link["href"].replace("/url?q=", "")))

輸出

Title: Coffee - Wikipedia

Title: Coffee: Benefits, nutrition, and risks - Medical News Today

...

Snippet: Coffee is a brewed drink prepared from roasted coffee beans, the seeds of berries from certain Coffea species. When coffee berries turn from green to bright red in color – indicating ripeness – they are picked, processed, and dried.

Snippet: When people think of coffee, they usually think of its ability to provide an energy boost. ... This article looks at the health benefits of drinking coffee, the evidence ...

...

或者,您可以通過SerpApi從 Google 搜索中提取數據。

curl示例

curl -s 'https://serpapi.com/search?q=coffee&location=Sweden&google_domain=google.se&gl=se&hl=sv&num=100'

Python 示例

from serpapi import GoogleSearch
import os

params = {
    "engine": "google",
    "q": "coffee",
    "location": "Sweden",
    "google_domain": "google.se",
    "gl": "se",
    "hl": "sv",
    "num": 100,
    "api_key": os.getenv("API_KEY")
}

client = GoogleSearch(params)
data = client.get_dict()

print("Organic results")

for result in data['organic_results']:
    print(f"""
Title: {result['title']}
Link: {result['link']}
Position: {result['position']}
Snippet: {result['snippet']}
""")

輸出

Organic results

Title: Coffee - Wikipedia
Link: https://en.wikipedia.org/wiki/Coffee
Position: 1
Snippet: Coffee is a brewed drink prepared from roasted coffee beans, the seeds of berries from certain Coffea species. When coffee berries turn from green to bright red ...


Title: Drop Coffee
Link: https://www.dropcoffee.com/
Position: 2
Snippet: Drop Coffee is an award winning roastery in Stockholm, representing Sweden four times in the World Coffee Roasting Championship, placing second, third and ...

...

免責聲明:我在 SerpApi 工作。

您可能想嘗試使用CSS選擇器,然后將文本拉出。

例如:

import requests
from bs4 import BeautifulSoup


page = requests.get("https://www.google.com/search?q=scrap").text
soup = BeautifulSoup(page, "html.parser").select(".s3v9rd.AP7Wnd")

for item in soup:
    print(item.getText(strip=True))

scrap樣本輸出:

丟棄或停止使用(多余的、舊的或不工作的車輛、船只或機器),尤其是為了將其轉化為廢金屬。

這是我的解決方案:代碼獲取谷歌搜索結果的所有標題、鏈接、面包屑和描述(沒有特色部分,有人問)你可以在搜索某些東西時看到

 query = "Your search term"
    driver_location = "C:\Program Files (x86)\chromedriver.exe"
    options = webdriver.ChromeOptions()
    options.add_argument('--lang=en,en_US')
    # options.add_argument('--disable-gpu')
    # options.add_argument('--no-sandbox')
  options.add_argument('Accept=text/html,application/xhtml+xml,application/xml;q=0.9,i

mage/webp')
# options.add_argument('Accept-Encoding= gzip')
# options.add_argument('Accept-Language= en-US,en;q=0.9,es;q=0.8')
# options.add_argument('Upgrade-Insecure-Requests: 1')
# options.add_argument('image/apng,*/*;q=0.8,application/signed-exchange;v=b3')
# options.add_argument('user-agent=' + ua['google chrome'])
# options.add_argument('proxy-server=' + "115.42.65.14:8080")
# options.add_argument('Referer=' + "https://www.google.com/")
driver = webdriver.Chrome(executable_path=driver_location,chrome_options=options)

driver.get("https://www.google.com/search?q={}&oq={}&hl=en&num=50".format(urllib.parse.quote(query),urllib.parse.quote(query)))
p = driver.find_elements_by_class_name("tF2Cxc")
titles = driver.find_elements_by_class_name("yuRUbf")
descriptions = driver.find_elements_by_class_name("IsZvec")
time.sleep(10)

link_list = []
description_list = []
featured = False
featured_links = 0
title_list = []
featured_max = 0
featured_num = 0

for index in range(len(p)):
    p_items = p[index].get_attribute("innerHTML")
    print(p_items)
    items_soup = BeautifulSoup(p_items,"html.parser")
    if(featured==False):
        if((len(items_soup.text.split("\n")) != 2)):
            print(items_soup.text.split("\n"))
            if ((items_soup.select(".IsZvec") != None) and 
                   (items_soup.select(".IsZvec")[0].text != "") and (items_soup.select(".IsZvec") != "")):
                a = items_soup.select("a",recursive=False)[0]["href"]
                print(a)
                link_list.append(a)
    title_list.append(titles[index].text)
    description_list.append(descriptions[index].text)
description_list_new = []
title_list_new = []
for index in range(len(description_list)):
    if (description_list[index] == ""):
        pass
    elif (re.findall(r'<\w{1,}\s\w{1,}>',description_list[index]) != []):
        pass
    else:
        description_list_new.append(description_list[index])
        title_list_new.append(title_list[index])
description_list = description_list_new
title_list = title_list_new

for title in range(len(title_list)):
    print(title_list[title])
    print(description_list[title])
    print("=======================")
print(link_list)
print(len(title_list))
print(len(link_list))

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM