[英]How to use BeautifulSoup to parse google search results in Python
I am trying to parse the first page of google search results.我正在尝试解析 google 搜索结果的第一页。 Specifically, the Title and the small Summary that is provided.
具体来说,提供的标题和小摘要。 Here is what I have so far:
这是我到目前为止所拥有的:
from urllib.request import urlretrieve
import urllib.parse
from urllib.parse import urlencode, urlparse, parse_qs
import webbrowser
from bs4 import BeautifulSoup
import requests
address = 'https://google.com/#q='
# Default Google search address start
file = open( "OCR.txt", "rt" )
# Open text document that contains the question
word = file.read()
file.close()
myList = [item for item in word.split('\n')]
newString = ' '.join(myList)
# The question is on multiple lines so this joins them together with proper spacing
print(newString)
qstr = urllib.parse.quote_plus(newString)
# Encode the string
newWord = address + qstr
# Combine the base and the encoded query
print(newWord)
source = requests.get(newWord)
soup = BeautifulSoup(source.text, 'lxml')
The part I am stuck on now is going down the HTML path to parse the specific data that I want.我现在坚持的部分是沿着 HTML 路径解析我想要的特定数据。 Everything I have tried so far has just thrown an error saying that it has no attribute or it just gives back "[]".
到目前为止,我所尝试的一切都只是抛出一个错误,说它没有属性,或者它只是返回“[]”。
I am new to Python and BeautifulSoup so I am not sure the syntax of how to get to where I want.我是 Python 和 BeautifulSoup 的新手,所以我不确定如何到达我想要的地方的语法。 I have found that these are the individual search results in the page:
我发现这些是页面中的单个搜索结果:
https://ibb.co/jfRakR https://ibb.co/jfRakR
Any help on what to add to parse the Title and Summary of each search result would be MASSIVELY appreciated.任何关于添加什么来解析每个搜索结果的标题和摘要的帮助都将不胜感激。
Thank you!谢谢!
Your url doesn't work for me.你的网址对我不起作用。 But with
https://google.com/search?q=
I get results.但是使用
https://google.com/search?q=
我得到了结果。
import urllib
from bs4 import BeautifulSoup
import requests
import webbrowser
text = 'hello world'
text = urllib.parse.quote_plus(text)
url = 'https://google.com/search?q=' + text
response = requests.get(url)
#with open('output.html', 'wb') as f:
# f.write(response.content)
#webbrowser.open('output.html')
soup = BeautifulSoup(response.text, 'lxml')
for g in soup.find_all(class_='g'):
print(g.text)
print('-----')
#
symbol.#
符号。 Instead, it should have ?
?
and /search
pathname ./search
路径名。So this ---> https://google.com/#q=
Should be this ---> https://www.google.com/search?q=cake
You need user-agent
to make it work because default python user-agent
is "python-requests"
and sites could identify it and block the script.您需要
user-agent
才能使其工作,因为默认的 python user-agent
是"python-requests"
,站点可以识别它并阻止脚本。 Check Robots.txt for more.查看Robots.txt了解更多信息。 That could be the reason why you're getting an empty result.
这可能是您得到空结果的原因。 Here you can find
user-agents
list to fake user visit. 在这里你可以找到
user-agents
列表来伪造用户访问。
You can use Google Organic Results API from SerpApi ( see at the end ).您可以使用来自 SerpApi 的Google Organic Results API (见文末)。
Code:代码:
from bs4 import BeautifulSoup
import requests
import json
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('https://www.google.com/search?q=java&oq=java',
headers=headers).text
soup = BeautifulSoup(html, 'lxml')
summary = []
for container in soup.findAll('div', class_='tF2Cxc'):
heading = container.find('h3', class_='LC20lb DKV0Md').text
article_summary = container.find('span', class_='aCOpRe').text
summary.append({
'Heading': heading,
'Article Summary': article_summary,
})
print(json.dumps(summary, indent=2, ensure_ascii=False))
Output JSON:输出 JSON:
[
{
"Heading": "Java | Oracle",
"Article Summary": "Java+You, Download Today! Java Download. » What is Java? » Need Help? » Uninstall. About Java. Go Java Java Training Java + Greenfoot Oracle Code One Oracle Academy for ..."
},
{
"Heading": "Oracle Java Technologies | Oracle",
"Article Summary": "Java Is the Language of Possibilities. Java is powering the innovation behind our digital world. Harness this potential with Java resources for student coders, ..."
},
{
"Heading": "Java Software | Oracle",
"Article Summary": "includes GraalVM Enterprise at no additional cost. Download Java now · Get support. Products. Oracle Java SE Subscription · Oracle JDK · Oracle OpenJDK · Oracle Java SE Platform ..."
},
{
"Heading": "Java (programming language) - Wikipedia",
"Article Summary": "Java is a class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible. It is a general-purpose ..."
},
{
"Heading": "Java - Wikipedia",
"Article Summary": "Java (Indonesian: Jawa, Indonesian pronunciation: [ˈdʒawa]; Javanese: ꦗꦮ; Sundanese: ᮏᮝ) is one of the islands of the Greater Sunda Islands in Indonesia, ..."
},
{
"Heading": "Google LLC v. Oracle America, Inc. - Supreme Court",
"Article Summary": "2 days ago — the Java programming language to work with its new Android plat- form, Google copied roughly 11,500 lines of code from the Java SE pro-."
},
{
"Heading": "OpenJDK - Java.net",
"Article Summary": "ZGC. Tools. Mercurial · Git · jtreg harness. Related. java.sun.com · Java Community Process · JDK GA/EA Builds · Oracle logo. © 2021 Oracle Corporation and/or its affiliates. Terms of ..."
}
]
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "java",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(f"Title: {result['title']}\nSummary: {result['snippet']}\n")
Output:输出:
Title: Java | Oracle
Summary: Java Download. » What is Java? » Need Help? » Uninstall. About Java. Go Java Java Training Java + Greenfoot Oracle Code One Oracle Academy for ...
Title: Oracle Java Technologies | Oracle
Summary: Java Is the Language of Possibilities. Java is powering the innovation behind our digital world. Harness this potential with Java resources for student coders, ...
Title: Java SE - Downloads | Oracle Technology Network | Oracle
Summary: Java SE downloads including: Java Development Kit (JDK), Server Java Runtime Environment (Server JRE), and Java Runtime Environment (JRE).
Title: Java (programming language) - Wikipedia
Summary: Java is a class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible. It is a general-purpose ...
Title: Java - Wikipedia
Summary: Java (Indonesian: Jawa, Indonesian pronunciation: [ˈdʒawa]; Javanese: ꦗꦮ; Sundanese: ᮏᮝ) is one of the islands of the Greater Sunda Islands in Indonesia, ...
Title: OpenJDK - Java.net
Summary: What is this? The place to collaborate on an open-source implementation of the Java Platform, Standard Edition, and related projects. (Learn more.).
Title: Java Resources for Students, Hobbyists and More | go.Java ...
Summary: Java Powers Our Digital World. Java is at the heart of our digital lifestyle. It's the platform for launching careers, exploring human-to-digital interfaces, architecting ...
Make sure you created an Environment variable file with your api_key确保您使用api_key创建了环境变量文件
Disclaimer, I work for SerpApi.
免责声明,我为 SerpApi 工作。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.