简体   繁体   English

如何使用 BeautifulSoup 在 Python 中解析谷歌搜索结果

[英]How to use BeautifulSoup to parse google search results in Python

I am trying to parse the first page of google search results.我正在尝试解析 google 搜索结果的第一页。 Specifically, the Title and the small Summary that is provided.具体来说,提供的标题和小摘要。 Here is what I have so far:这是我到目前为止所拥有的:

from urllib.request import urlretrieve
import urllib.parse
from urllib.parse import urlencode, urlparse, parse_qs
import webbrowser
from bs4 import BeautifulSoup
import requests

address = 'https://google.com/#q='
# Default Google search address start
file = open( "OCR.txt", "rt" )
# Open text document that contains the question
word = file.read()
file.close()

myList = [item for item in word.split('\n')]
newString = ' '.join(myList)
# The question is on multiple lines so this joins them together with proper spacing

print(newString)

qstr = urllib.parse.quote_plus(newString)
# Encode the string

newWord = address + qstr
# Combine the base and the encoded query

print(newWord)

source = requests.get(newWord)

soup = BeautifulSoup(source.text, 'lxml')

The part I am stuck on now is going down the HTML path to parse the specific data that I want.我现在坚持的部分是沿着 HTML 路径解析我想要的特定数据。 Everything I have tried so far has just thrown an error saying that it has no attribute or it just gives back "[]".到目前为止,我所尝试的一切都只是抛出一个错误,说它没有属性,或者它只是返回“[]”。

I am new to Python and BeautifulSoup so I am not sure the syntax of how to get to where I want.我是 Python 和 BeautifulSoup 的新手,所以我不确定如何到达我想要的地方的语法。 I have found that these are the individual search results in the page:我发现这些是页面中的单个搜索结果:

https://ibb.co/jfRakR https://ibb.co/jfRakR

Any help on what to add to parse the Title and Summary of each search result would be MASSIVELY appreciated.任何关于添加什么来解析每个搜索结果的标题和摘要的帮助都将不胜感激。

Thank you!谢谢!

Your url doesn't work for me.你的网址对我不起作用。 But with https://google.com/search?q= I get results.但是使用https://google.com/search?q=我得到了结果。

import urllib
from bs4 import BeautifulSoup
import requests
import webbrowser

text = 'hello world'
text = urllib.parse.quote_plus(text)

url = 'https://google.com/search?q=' + text

response = requests.get(url)

#with open('output.html', 'wb') as f:
#    f.write(response.content)
#webbrowser.open('output.html')

soup = BeautifulSoup(response.text, 'lxml')
for g in soup.find_all(class_='g'):
    print(g.text)
    print('-----')

Read Beautiful Soup Documentation阅读美丽的汤文档

  1. Default Google search address start - it's a bit incorrect.默认的 Google 搜索地址开始- 这有点不正确。 It doesn't contain # symbol.它不包含#符号。 Instead, it should have ?相反,它应该有? and /search pathname ./search路径名
So this ---> https://google.com/#q=
Should be this ---> https://www.google.com/search?q=cake
  1. You need user-agent to make it work because default python user-agent is "python-requests" and sites could identify it and block the script.您需要user-agent才能使其工作,因为默认的 python user-agent"python-requests" ,站点可以识别它并阻止脚本。 Check Robots.txt for more.查看Robots.txt了解更多信息。 That could be the reason why you're getting an empty result.这可能是您得到空结果的原因。 Here you can find user-agents list to fake user visit. 在这里你可以找到user-agents列表来伪造用户访问。

  2. You can use Google Organic Results API from SerpApi ( see at the end ).您可以使用来自 SerpApi 的Google Organic Results API见文末)。

Code:代码:

from bs4 import BeautifulSoup
import requests
import json

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

html = requests.get('https://www.google.com/search?q=java&oq=java',
                    headers=headers).text

soup = BeautifulSoup(html, 'lxml')

summary = []

for container in soup.findAll('div', class_='tF2Cxc'):
  heading = container.find('h3', class_='LC20lb DKV0Md').text
  article_summary = container.find('span', class_='aCOpRe').text

  summary.append({
      'Heading': heading,
      'Article Summary': article_summary,
  })

print(json.dumps(summary, indent=2, ensure_ascii=False))

Output JSON:输出 JSON:

[
  {
    "Heading": "Java | Oracle",
    "Article Summary": "Java+You, Download Today! Java Download. » What is Java? » Need Help? » Uninstall. About Java. Go Java Java Training Java + Greenfoot Oracle Code One Oracle Academy for ..."
  },
  {
    "Heading": "Oracle Java Technologies | Oracle",
    "Article Summary": "Java Is the Language of Possibilities. Java is powering the innovation behind our digital world. Harness this potential with Java resources for student coders, ..."
  },
  {
    "Heading": "Java Software | Oracle",
    "Article Summary": "includes GraalVM Enterprise at no additional cost. Download Java now · Get support. Products. Oracle Java SE Subscription · Oracle JDK · Oracle OpenJDK · Oracle Java SE Platform ..."
  },
  {
    "Heading": "Java (programming language) - Wikipedia",
    "Article Summary": "Java is a class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible. It is a general-purpose ..."
  },
  {
    "Heading": "Java - Wikipedia",
    "Article Summary": "Java (Indonesian: Jawa, Indonesian pronunciation: [ˈdʒawa]; Javanese: ꦗꦮ; Sundanese: ᮏᮝ) is one of the islands of the Greater Sunda Islands in Indonesia, ..."
  },
  {
    "Heading": "Google LLC v. Oracle America, Inc. - Supreme Court",
    "Article Summary": "2 days ago — the Java programming language to work with its new Android plat- form, Google copied roughly 11,500 lines of code from the Java SE pro-."
  },
  {
    "Heading": "OpenJDK - Java.net",
    "Article Summary": "ZGC. Tools. Mercurial · Git · jtreg harness. Related. java.sun.com · Java Community Process · JDK GA/EA Builds · Oracle logo. © 2021 Oracle Corporation and/or its affiliates. Terms of ..."
  }
]

Using SerpApi :使用SerpApi

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "java",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results["organic_results"]:
   print(f"Title: {result['title']}\nSummary: {result['snippet']}\n")

Output:输出:

Title: Java | Oracle
Summary: Java Download. » What is Java? » Need Help? » Uninstall. About Java. Go Java Java Training Java + Greenfoot Oracle Code One Oracle Academy for ...

Title: Oracle Java Technologies | Oracle
Summary: Java Is the Language of Possibilities. Java is powering the innovation behind our digital world. Harness this potential with Java resources for student coders, ...

Title: Java SE - Downloads | Oracle Technology Network | Oracle
Summary: Java SE downloads including: Java Development Kit (JDK), Server Java Runtime Environment (Server JRE), and Java Runtime Environment (JRE).

Title: Java (programming language) - Wikipedia
Summary: Java is a class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible. It is a general-purpose ...

Title: Java - Wikipedia
Summary: Java (Indonesian: Jawa, Indonesian pronunciation: [ˈdʒawa]; Javanese: ꦗꦮ; Sundanese: ᮏᮝ) is one of the islands of the Greater Sunda Islands in Indonesia, ...

Title: OpenJDK - Java.net
Summary: What is this? The place to collaborate on an open-source implementation of the Java Platform, Standard Edition, and related projects. (Learn more.).

Title: Java Resources for Students, Hobbyists and More | go.Java ...
Summary: Java Powers Our Digital World. Java is at the heart of our digital lifestyle. It's the platform for launching careers, exploring human-to-digital interfaces, architecting ...

Make sure you created an Environment variable file with your api_key确保您使用api_key创建了环境变量文件

Disclaimer, I work for SerpApi.免责声明,我为 SerpApi 工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM