简体   繁体   中英

How to use BeautifulSoup to parse google search results in Python

I am trying to parse the first page of google search results. Specifically, the Title and the small Summary that is provided. Here is what I have so far:

from urllib.request import urlretrieve
import urllib.parse
from urllib.parse import urlencode, urlparse, parse_qs
import webbrowser
from bs4 import BeautifulSoup
import requests

address = 'https://google.com/#q='
# Default Google search address start
file = open( "OCR.txt", "rt" )
# Open text document that contains the question
word = file.read()
file.close()

myList = [item for item in word.split('\n')]
newString = ' '.join(myList)
# The question is on multiple lines so this joins them together with proper spacing

print(newString)

qstr = urllib.parse.quote_plus(newString)
# Encode the string

newWord = address + qstr
# Combine the base and the encoded query

print(newWord)

source = requests.get(newWord)

soup = BeautifulSoup(source.text, 'lxml')

The part I am stuck on now is going down the HTML path to parse the specific data that I want. Everything I have tried so far has just thrown an error saying that it has no attribute or it just gives back "[]".

I am new to Python and BeautifulSoup so I am not sure the syntax of how to get to where I want. I have found that these are the individual search results in the page:

https://ibb.co/jfRakR

Any help on what to add to parse the Title and Summary of each search result would be MASSIVELY appreciated.

Thank you!

Your url doesn't work for me. But with https://google.com/search?q= I get results.

import urllib
from bs4 import BeautifulSoup
import requests
import webbrowser

text = 'hello world'
text = urllib.parse.quote_plus(text)

url = 'https://google.com/search?q=' + text

response = requests.get(url)

#with open('output.html', 'wb') as f:
#    f.write(response.content)
#webbrowser.open('output.html')

soup = BeautifulSoup(response.text, 'lxml')
for g in soup.find_all(class_='g'):
    print(g.text)
    print('-----')

Read Beautiful Soup Documentation

  1. Default Google search address start - it's a bit incorrect. It doesn't contain # symbol. Instead, it should have ? and /search pathname .
So this ---> https://google.com/#q=
Should be this ---> https://www.google.com/search?q=cake
  1. You need user-agent to make it work because default python user-agent is "python-requests" and sites could identify it and block the script. Check Robots.txt for more. That could be the reason why you're getting an empty result. Here you can find user-agents list to fake user visit.

  2. You can use Google Organic Results API from SerpApi ( see at the end ).

Code:

from bs4 import BeautifulSoup
import requests
import json

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

html = requests.get('https://www.google.com/search?q=java&oq=java',
                    headers=headers).text

soup = BeautifulSoup(html, 'lxml')

summary = []

for container in soup.findAll('div', class_='tF2Cxc'):
  heading = container.find('h3', class_='LC20lb DKV0Md').text
  article_summary = container.find('span', class_='aCOpRe').text

  summary.append({
      'Heading': heading,
      'Article Summary': article_summary,
  })

print(json.dumps(summary, indent=2, ensure_ascii=False))

Output JSON:

[
  {
    "Heading": "Java | Oracle",
    "Article Summary": "Java+You, Download Today! Java Download. » What is Java? » Need Help? » Uninstall. About Java. Go Java Java Training Java + Greenfoot Oracle Code One Oracle Academy for ..."
  },
  {
    "Heading": "Oracle Java Technologies | Oracle",
    "Article Summary": "Java Is the Language of Possibilities. Java is powering the innovation behind our digital world. Harness this potential with Java resources for student coders, ..."
  },
  {
    "Heading": "Java Software | Oracle",
    "Article Summary": "includes GraalVM Enterprise at no additional cost. Download Java now · Get support. Products. Oracle Java SE Subscription · Oracle JDK · Oracle OpenJDK · Oracle Java SE Platform ..."
  },
  {
    "Heading": "Java (programming language) - Wikipedia",
    "Article Summary": "Java is a class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible. It is a general-purpose ..."
  },
  {
    "Heading": "Java - Wikipedia",
    "Article Summary": "Java (Indonesian: Jawa, Indonesian pronunciation: [ˈdʒawa]; Javanese: ꦗꦮ; Sundanese: ᮏᮝ) is one of the islands of the Greater Sunda Islands in Indonesia, ..."
  },
  {
    "Heading": "Google LLC v. Oracle America, Inc. - Supreme Court",
    "Article Summary": "2 days ago — the Java programming language to work with its new Android plat- form, Google copied roughly 11,500 lines of code from the Java SE pro-."
  },
  {
    "Heading": "OpenJDK - Java.net",
    "Article Summary": "ZGC. Tools. Mercurial · Git · jtreg harness. Related. java.sun.com · Java Community Process · JDK GA/EA Builds · Oracle logo. © 2021 Oracle Corporation and/or its affiliates. Terms of ..."
  }
]

Using SerpApi :

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "java",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results["organic_results"]:
   print(f"Title: {result['title']}\nSummary: {result['snippet']}\n")

Output:

Title: Java | Oracle
Summary: Java Download. » What is Java? » Need Help? » Uninstall. About Java. Go Java Java Training Java + Greenfoot Oracle Code One Oracle Academy for ...

Title: Oracle Java Technologies | Oracle
Summary: Java Is the Language of Possibilities. Java is powering the innovation behind our digital world. Harness this potential with Java resources for student coders, ...

Title: Java SE - Downloads | Oracle Technology Network | Oracle
Summary: Java SE downloads including: Java Development Kit (JDK), Server Java Runtime Environment (Server JRE), and Java Runtime Environment (JRE).

Title: Java (programming language) - Wikipedia
Summary: Java is a class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible. It is a general-purpose ...

Title: Java - Wikipedia
Summary: Java (Indonesian: Jawa, Indonesian pronunciation: [ˈdʒawa]; Javanese: ꦗꦮ; Sundanese: ᮏᮝ) is one of the islands of the Greater Sunda Islands in Indonesia, ...

Title: OpenJDK - Java.net
Summary: What is this? The place to collaborate on an open-source implementation of the Java Platform, Standard Edition, and related projects. (Learn more.).

Title: Java Resources for Students, Hobbyists and More | go.Java ...
Summary: Java Powers Our Digital World. Java is at the heart of our digital lifestyle. It's the platform for launching careers, exploring human-to-digital interfaces, architecting ...

Make sure you created an Environment variable file with your api_key

Disclaimer, I work for SerpApi.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM