[英]Scrape Google Search Result Description Using BeautifulSoup
我想使用 BeautifulSoup 抓取 Google 搜索結果描述,但我無法抓取包含描述的標簽。
祖先:
html
body#gsr.srp.vasq.wf-b
div#main
div#cnt.big
div.mw
div#rcnt
div.col
div#center_col
div#res.med
div#search
div
div#rso
div.g
div.rc
div.IsZvec
div
span.aCOpRe
孩子們
em
蟒蛇代碼:
from bs4 import BeautifulSoup
import requests
import bs4.builder._lxml
import re
search = input("Enter the search term:")
param = {"q": search}
r = requests.get("https://google.com/search?q=", params = param)
soup = BeautifulSoup(r.content, "lxml")
soup.prettify()
title = soup.findAll("div",class_ = "BNeawe vvjwJb AP7Wnd")
for t in title:
print(t.get_text())
description = soup.findAll("span", class_ = "aCOpRe")
for d in description:
print(d.get_text())
print("\n")
link = soup.findAll("a")
for link in soup.find_all("a",href=re.compile("(?<=/url\?q=)(htt.*://.*)")):
print(re.split(":(?=http)",link["href"].replace("/url?q=","")))
用於 Google 搜索結果片段(描述)的正確 CSS 選擇器是.aCOpRe span:not(.f)
。
from bs4 import BeautifulSoup
import requests
import re
param = {"q": "coffee"}
headers = {
"User-Agent":
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.5 Safari/605.1.15"
}
r = requests.get("https://google.com/search", params=param, headers=headers)
soup = BeautifulSoup(r.content, "lxml")
soup.prettify()
title = soup.select(".DKV0Md span")
for t in title:
print(f"Title: {t.get_text()}\n")
snippets = soup.select(".aCOpRe span:not(.f)")
for d in snippets:
print(f"Snippet: {d.get_text()}\n")
link = soup.findAll("a")
for link in soup.find_all("a", href=re.compile("(?<=/url\?q=)(htt.*://.*)")):
print(re.split(":(?=http)", link["href"].replace("/url?q=", "")))
輸出
Title: Coffee - Wikipedia
Title: Coffee: Benefits, nutrition, and risks - Medical News Today
...
Snippet: Coffee is a brewed drink prepared from roasted coffee beans, the seeds of berries from certain Coffea species. When coffee berries turn from green to bright red in color – indicating ripeness – they are picked, processed, and dried.
Snippet: When people think of coffee, they usually think of its ability to provide an energy boost. ... This article looks at the health benefits of drinking coffee, the evidence ...
...
或者,您可以通過SerpApi從 Google 搜索中提取數據。
curl
示例
curl -s 'https://serpapi.com/search?q=coffee&location=Sweden&google_domain=google.se&gl=se&hl=sv&num=100'
Python 示例
from serpapi import GoogleSearch
import os
params = {
"engine": "google",
"q": "coffee",
"location": "Sweden",
"google_domain": "google.se",
"gl": "se",
"hl": "sv",
"num": 100,
"api_key": os.getenv("API_KEY")
}
client = GoogleSearch(params)
data = client.get_dict()
print("Organic results")
for result in data['organic_results']:
print(f"""
Title: {result['title']}
Link: {result['link']}
Position: {result['position']}
Snippet: {result['snippet']}
""")
輸出
Organic results
Title: Coffee - Wikipedia
Link: https://en.wikipedia.org/wiki/Coffee
Position: 1
Snippet: Coffee is a brewed drink prepared from roasted coffee beans, the seeds of berries from certain Coffea species. When coffee berries turn from green to bright red ...
Title: Drop Coffee
Link: https://www.dropcoffee.com/
Position: 2
Snippet: Drop Coffee is an award winning roastery in Stockholm, representing Sweden four times in the World Coffee Roasting Championship, placing second, third and ...
...
免責聲明:我在 SerpApi 工作。
您可能想嘗試使用CSS
選擇器,然后將文本拉出。
例如:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.google.com/search?q=scrap").text
soup = BeautifulSoup(page, "html.parser").select(".s3v9rd.AP7Wnd")
for item in soup:
print(item.getText(strip=True))
scrap
樣本輸出:
丟棄或停止使用(多余的、舊的或不工作的車輛、船只或機器),尤其是為了將其轉化為廢金屬。
這是我的解決方案:代碼獲取谷歌搜索結果的所有標題、鏈接、面包屑和描述(沒有特色部分,有人問)你可以在搜索某些東西時看到
query = "Your search term"
driver_location = "C:\Program Files (x86)\chromedriver.exe"
options = webdriver.ChromeOptions()
options.add_argument('--lang=en,en_US')
# options.add_argument('--disable-gpu')
# options.add_argument('--no-sandbox')
options.add_argument('Accept=text/html,application/xhtml+xml,application/xml;q=0.9,i
mage/webp')
# options.add_argument('Accept-Encoding= gzip')
# options.add_argument('Accept-Language= en-US,en;q=0.9,es;q=0.8')
# options.add_argument('Upgrade-Insecure-Requests: 1')
# options.add_argument('image/apng,*/*;q=0.8,application/signed-exchange;v=b3')
# options.add_argument('user-agent=' + ua['google chrome'])
# options.add_argument('proxy-server=' + "115.42.65.14:8080")
# options.add_argument('Referer=' + "https://www.google.com/")
driver = webdriver.Chrome(executable_path=driver_location,chrome_options=options)
driver.get("https://www.google.com/search?q={}&oq={}&hl=en&num=50".format(urllib.parse.quote(query),urllib.parse.quote(query)))
p = driver.find_elements_by_class_name("tF2Cxc")
titles = driver.find_elements_by_class_name("yuRUbf")
descriptions = driver.find_elements_by_class_name("IsZvec")
time.sleep(10)
link_list = []
description_list = []
featured = False
featured_links = 0
title_list = []
featured_max = 0
featured_num = 0
for index in range(len(p)):
p_items = p[index].get_attribute("innerHTML")
print(p_items)
items_soup = BeautifulSoup(p_items,"html.parser")
if(featured==False):
if((len(items_soup.text.split("\n")) != 2)):
print(items_soup.text.split("\n"))
if ((items_soup.select(".IsZvec") != None) and
(items_soup.select(".IsZvec")[0].text != "") and (items_soup.select(".IsZvec") != "")):
a = items_soup.select("a",recursive=False)[0]["href"]
print(a)
link_list.append(a)
title_list.append(titles[index].text)
description_list.append(descriptions[index].text)
description_list_new = []
title_list_new = []
for index in range(len(description_list)):
if (description_list[index] == ""):
pass
elif (re.findall(r'<\w{1,}\s\w{1,}>',description_list[index]) != []):
pass
else:
description_list_new.append(description_list[index])
title_list_new.append(title_list[index])
description_list = description_list_new
title_list = title_list_new
for title in range(len(title_list)):
print(title_list[title])
print(description_list[title])
print("=======================")
print(link_list)
print(len(title_list))
print(len(link_list))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.