简体   繁体   English

使用 BeautifulSoup 抓取 Google 搜索结果描述

[英]Scrape Google Search Result Description Using BeautifulSoup

I want to Scrape Google Search Result Description Using BeautifulSoup but I am not able to scrape the tag which is containing the description.我想使用 BeautifulSoup 抓取 Google 搜索结果描述,但我无法抓取包含描述的标签。

Ancestor:祖先:

html
body#gsr.srp.vasq.wf-b
div#main
div#cnt.big
div.mw
div#rcnt
div.col
div#center_col
div#res.med
div#search
div
div#rso
div.g
div.rc
div.IsZvec
div
span.aCOpRe

Children孩子们

em

Python Code:蟒蛇代码:

from bs4 import BeautifulSoup
import requests
import bs4.builder._lxml
import re

search = input("Enter the search term:")
param = {"q": search}

r = requests.get("https://google.com/search?q=", params = param)

soup = BeautifulSoup(r.content, "lxml")
soup.prettify()

title = soup.findAll("div",class_ = "BNeawe vvjwJb AP7Wnd")

for t in title:
    print(t.get_text())

description = soup.findAll("span", class_ = "aCOpRe")

for d in description:
    print(d.get_text())

print("\n")
link = soup.findAll("a")

for link in  soup.find_all("a",href=re.compile("(?<=/url\?q=)(htt.*://.*)")):
    print(re.split(":(?=http)",link["href"].replace("/url?q=","")))

Image Link displaying the tag显示标签的图片链接

The proper CSS selector for snippets (descriptions) of Google Search results is .aCOpRe span:not(.f) .用于 Google 搜索结果片段(描述)的正确 CSS 选择器是.aCOpRe span:not(.f)

Here's a full example in online IDE .这是在线 IDE 中的完整示例

from bs4 import BeautifulSoup
import requests
import re

param = {"q": "coffee"}
headers = {
    "User-Agent":
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.5 Safari/605.1.15"
}

r = requests.get("https://google.com/search", params=param, headers=headers)

soup = BeautifulSoup(r.content, "lxml")
soup.prettify()

title = soup.select(".DKV0Md span")

for t in title:
    print(f"Title: {t.get_text()}\n")

snippets = soup.select(".aCOpRe span:not(.f)")

for d in snippets:
    print(f"Snippet: {d.get_text()}\n")

link = soup.findAll("a")

for link in soup.find_all("a", href=re.compile("(?<=/url\?q=)(htt.*://.*)")):
    print(re.split(":(?=http)", link["href"].replace("/url?q=", "")))

Output输出

Title: Coffee - Wikipedia

Title: Coffee: Benefits, nutrition, and risks - Medical News Today

...

Snippet: Coffee is a brewed drink prepared from roasted coffee beans, the seeds of berries from certain Coffea species. When coffee berries turn from green to bright red in color – indicating ripeness – they are picked, processed, and dried.

Snippet: When people think of coffee, they usually think of its ability to provide an energy boost. ... This article looks at the health benefits of drinking coffee, the evidence ...

...

Alternatively, you can extract data from Google Search via SerpApi .或者,您可以通过SerpApi从 Google 搜索中提取数据。

curl example curl示例

curl -s 'https://serpapi.com/search?q=coffee&location=Sweden&google_domain=google.se&gl=se&hl=sv&num=100'

Python example Python 示例

from serpapi import GoogleSearch
import os

params = {
    "engine": "google",
    "q": "coffee",
    "location": "Sweden",
    "google_domain": "google.se",
    "gl": "se",
    "hl": "sv",
    "num": 100,
    "api_key": os.getenv("API_KEY")
}

client = GoogleSearch(params)
data = client.get_dict()

print("Organic results")

for result in data['organic_results']:
    print(f"""
Title: {result['title']}
Link: {result['link']}
Position: {result['position']}
Snippet: {result['snippet']}
""")

Output输出

Organic results

Title: Coffee - Wikipedia
Link: https://en.wikipedia.org/wiki/Coffee
Position: 1
Snippet: Coffee is a brewed drink prepared from roasted coffee beans, the seeds of berries from certain Coffea species. When coffee berries turn from green to bright red ...


Title: Drop Coffee
Link: https://www.dropcoffee.com/
Position: 2
Snippet: Drop Coffee is an award winning roastery in Stockholm, representing Sweden four times in the World Coffee Roasting Championship, placing second, third and ...

...

Disclaimer : I work at SerpApi.免责声明:我在 SerpApi 工作。

You might want to try the CSS selector and then just pull the text out.您可能想尝试使用CSS选择器,然后将文本拉出。

For example:例如:

import requests
from bs4 import BeautifulSoup


page = requests.get("https://www.google.com/search?q=scrap").text
soup = BeautifulSoup(page, "html.parser").select(".s3v9rd.AP7Wnd")

for item in soup:
    print(item.getText(strip=True))

Sample output for scrap : scrap样本输出:

discard or remove from service (a redundant, old, or inoperative vehicle, vessel, or machine), especially so as to convert it to scrap metal.丢弃或停止使用(多余的、旧的或不工作的车辆、船只或机器),尤其是为了将其转化为废金属。

Here is my solution: the code gets all the title and links and breadcrumbs and descriptions of the google search results(no featured section, people who ask)which you can see when you search for something这是我的解决方案:代码获取谷歌搜索结果的所有标题、链接、面包屑和描述(没有特色部分,有人问)你可以在搜索某些东西时看到

 query = "Your search term"
    driver_location = "C:\Program Files (x86)\chromedriver.exe"
    options = webdriver.ChromeOptions()
    options.add_argument('--lang=en,en_US')
    # options.add_argument('--disable-gpu')
    # options.add_argument('--no-sandbox')
  options.add_argument('Accept=text/html,application/xhtml+xml,application/xml;q=0.9,i

mage/webp')
# options.add_argument('Accept-Encoding= gzip')
# options.add_argument('Accept-Language= en-US,en;q=0.9,es;q=0.8')
# options.add_argument('Upgrade-Insecure-Requests: 1')
# options.add_argument('image/apng,*/*;q=0.8,application/signed-exchange;v=b3')
# options.add_argument('user-agent=' + ua['google chrome'])
# options.add_argument('proxy-server=' + "115.42.65.14:8080")
# options.add_argument('Referer=' + "https://www.google.com/")
driver = webdriver.Chrome(executable_path=driver_location,chrome_options=options)

driver.get("https://www.google.com/search?q={}&oq={}&hl=en&num=50".format(urllib.parse.quote(query),urllib.parse.quote(query)))
p = driver.find_elements_by_class_name("tF2Cxc")
titles = driver.find_elements_by_class_name("yuRUbf")
descriptions = driver.find_elements_by_class_name("IsZvec")
time.sleep(10)

link_list = []
description_list = []
featured = False
featured_links = 0
title_list = []
featured_max = 0
featured_num = 0

for index in range(len(p)):
    p_items = p[index].get_attribute("innerHTML")
    print(p_items)
    items_soup = BeautifulSoup(p_items,"html.parser")
    if(featured==False):
        if((len(items_soup.text.split("\n")) != 2)):
            print(items_soup.text.split("\n"))
            if ((items_soup.select(".IsZvec") != None) and 
                   (items_soup.select(".IsZvec")[0].text != "") and (items_soup.select(".IsZvec") != "")):
                a = items_soup.select("a",recursive=False)[0]["href"]
                print(a)
                link_list.append(a)
    title_list.append(titles[index].text)
    description_list.append(descriptions[index].text)
description_list_new = []
title_list_new = []
for index in range(len(description_list)):
    if (description_list[index] == ""):
        pass
    elif (re.findall(r'<\w{1,}\s\w{1,}>',description_list[index]) != []):
        pass
    else:
        description_list_new.append(description_list[index])
        title_list_new.append(title_list[index])
description_list = description_list_new
title_list = title_list_new

for title in range(len(title_list)):
    print(title_list[title])
    print(description_list[title])
    print("=======================")
print(link_list)
print(len(title_list))
print(len(link_list))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM