[英]Scrape Google Search Result Description Using BeautifulSoup
I want to Scrape Google Search Result Description Using BeautifulSoup but I am not able to scrape the tag which is containing the description.我想使用 BeautifulSoup 抓取 Google 搜索结果描述,但我无法抓取包含描述的标签。
Ancestor:祖先:
html
body#gsr.srp.vasq.wf-b
div#main
div#cnt.big
div.mw
div#rcnt
div.col
div#center_col
div#res.med
div#search
div
div#rso
div.g
div.rc
div.IsZvec
div
span.aCOpRe
Children孩子们
em
Python Code:蟒蛇代码:
from bs4 import BeautifulSoup
import requests
import bs4.builder._lxml
import re
search = input("Enter the search term:")
param = {"q": search}
r = requests.get("https://google.com/search?q=", params = param)
soup = BeautifulSoup(r.content, "lxml")
soup.prettify()
title = soup.findAll("div",class_ = "BNeawe vvjwJb AP7Wnd")
for t in title:
print(t.get_text())
description = soup.findAll("span", class_ = "aCOpRe")
for d in description:
print(d.get_text())
print("\n")
link = soup.findAll("a")
for link in soup.find_all("a",href=re.compile("(?<=/url\?q=)(htt.*://.*)")):
print(re.split(":(?=http)",link["href"].replace("/url?q=","")))
The proper CSS selector for snippets (descriptions) of Google Search results is .aCOpRe span:not(.f)
.用于 Google 搜索结果片段(描述)的正确 CSS 选择器是.aCOpRe span:not(.f)
。
Here's a full example in online IDE .这是在线 IDE 中的完整示例。
from bs4 import BeautifulSoup
import requests
import re
param = {"q": "coffee"}
headers = {
"User-Agent":
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.5 Safari/605.1.15"
}
r = requests.get("https://google.com/search", params=param, headers=headers)
soup = BeautifulSoup(r.content, "lxml")
soup.prettify()
title = soup.select(".DKV0Md span")
for t in title:
print(f"Title: {t.get_text()}\n")
snippets = soup.select(".aCOpRe span:not(.f)")
for d in snippets:
print(f"Snippet: {d.get_text()}\n")
link = soup.findAll("a")
for link in soup.find_all("a", href=re.compile("(?<=/url\?q=)(htt.*://.*)")):
print(re.split(":(?=http)", link["href"].replace("/url?q=", "")))
Output输出
Title: Coffee - Wikipedia
Title: Coffee: Benefits, nutrition, and risks - Medical News Today
...
Snippet: Coffee is a brewed drink prepared from roasted coffee beans, the seeds of berries from certain Coffea species. When coffee berries turn from green to bright red in color – indicating ripeness – they are picked, processed, and dried.
Snippet: When people think of coffee, they usually think of its ability to provide an energy boost. ... This article looks at the health benefits of drinking coffee, the evidence ...
...
Alternatively, you can extract data from Google Search via SerpApi .或者,您可以通过SerpApi从 Google 搜索中提取数据。
curl
example curl
示例
curl -s 'https://serpapi.com/search?q=coffee&location=Sweden&google_domain=google.se&gl=se&hl=sv&num=100'
Python example Python 示例
from serpapi import GoogleSearch
import os
params = {
"engine": "google",
"q": "coffee",
"location": "Sweden",
"google_domain": "google.se",
"gl": "se",
"hl": "sv",
"num": 100,
"api_key": os.getenv("API_KEY")
}
client = GoogleSearch(params)
data = client.get_dict()
print("Organic results")
for result in data['organic_results']:
print(f"""
Title: {result['title']}
Link: {result['link']}
Position: {result['position']}
Snippet: {result['snippet']}
""")
Output输出
Organic results
Title: Coffee - Wikipedia
Link: https://en.wikipedia.org/wiki/Coffee
Position: 1
Snippet: Coffee is a brewed drink prepared from roasted coffee beans, the seeds of berries from certain Coffea species. When coffee berries turn from green to bright red ...
Title: Drop Coffee
Link: https://www.dropcoffee.com/
Position: 2
Snippet: Drop Coffee is an award winning roastery in Stockholm, representing Sweden four times in the World Coffee Roasting Championship, placing second, third and ...
...
Disclaimer : I work at SerpApi.免责声明:我在 SerpApi 工作。
You might want to try the CSS
selector and then just pull the text out.您可能想尝试使用CSS
选择器,然后将文本拉出。
For example:例如:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.google.com/search?q=scrap").text
soup = BeautifulSoup(page, "html.parser").select(".s3v9rd.AP7Wnd")
for item in soup:
print(item.getText(strip=True))
Sample output for scrap
: scrap
样本输出:
discard or remove from service (a redundant, old, or inoperative vehicle, vessel, or machine), especially so as to convert it to scrap metal.丢弃或停止使用(多余的、旧的或不工作的车辆、船只或机器),尤其是为了将其转化为废金属。
Here is my solution: the code gets all the title and links and breadcrumbs and descriptions of the google search results(no featured section, people who ask)which you can see when you search for something这是我的解决方案:代码获取谷歌搜索结果的所有标题、链接、面包屑和描述(没有特色部分,有人问)你可以在搜索某些东西时看到
query = "Your search term"
driver_location = "C:\Program Files (x86)\chromedriver.exe"
options = webdriver.ChromeOptions()
options.add_argument('--lang=en,en_US')
# options.add_argument('--disable-gpu')
# options.add_argument('--no-sandbox')
options.add_argument('Accept=text/html,application/xhtml+xml,application/xml;q=0.9,i
mage/webp')
# options.add_argument('Accept-Encoding= gzip')
# options.add_argument('Accept-Language= en-US,en;q=0.9,es;q=0.8')
# options.add_argument('Upgrade-Insecure-Requests: 1')
# options.add_argument('image/apng,*/*;q=0.8,application/signed-exchange;v=b3')
# options.add_argument('user-agent=' + ua['google chrome'])
# options.add_argument('proxy-server=' + "115.42.65.14:8080")
# options.add_argument('Referer=' + "https://www.google.com/")
driver = webdriver.Chrome(executable_path=driver_location,chrome_options=options)
driver.get("https://www.google.com/search?q={}&oq={}&hl=en&num=50".format(urllib.parse.quote(query),urllib.parse.quote(query)))
p = driver.find_elements_by_class_name("tF2Cxc")
titles = driver.find_elements_by_class_name("yuRUbf")
descriptions = driver.find_elements_by_class_name("IsZvec")
time.sleep(10)
link_list = []
description_list = []
featured = False
featured_links = 0
title_list = []
featured_max = 0
featured_num = 0
for index in range(len(p)):
p_items = p[index].get_attribute("innerHTML")
print(p_items)
items_soup = BeautifulSoup(p_items,"html.parser")
if(featured==False):
if((len(items_soup.text.split("\n")) != 2)):
print(items_soup.text.split("\n"))
if ((items_soup.select(".IsZvec") != None) and
(items_soup.select(".IsZvec")[0].text != "") and (items_soup.select(".IsZvec") != "")):
a = items_soup.select("a",recursive=False)[0]["href"]
print(a)
link_list.append(a)
title_list.append(titles[index].text)
description_list.append(descriptions[index].text)
description_list_new = []
title_list_new = []
for index in range(len(description_list)):
if (description_list[index] == ""):
pass
elif (re.findall(r'<\w{1,}\s\w{1,}>',description_list[index]) != []):
pass
else:
description_list_new.append(description_list[index])
title_list_new.append(title_list[index])
description_list = description_list_new
title_list = title_list_new
for title in range(len(title_list)):
print(title_list[title])
print(description_list[title])
print("=======================")
print(link_list)
print(len(title_list))
print(len(link_list))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.