简体   繁体   中英

Scrape the snippet text from google search page

When we search a question in google it often produces an answer in a snippet like the following:

结果何时+巴拉克+奥巴马+出生

My objective is to scrape this text (" August 4, 1961 " encircled in red mark in the screenshot) in my python code.

Before trying to scrape the text, I stored the web response in a text file using the following code:

page = requests.get("https://www.google.com/search?q=when+barak+obama+born")
soup = BeautifulSoup(page.content, 'html.parser')
out_file = open("web_response.txt", "w", encoding='utf-8')
out_file.write(soup.prettify())

In the inspect element section, I noticed that the snippet is inside div class Z0LcW XcVN5d (encircled in green mark in the screenshot). However, the response in my txt file contains no such text, let alone class name.

I've also tried this solution where the author scraped items with id rhs_block . But my response contains no such id.

I've searched the occurrences of "August 4, 1961" in my response txt file and tried to comprehend whether it could be the snippet. But none of the occurences seemed to be the one that I was looking for.

My plan was to get the div id or class name of the snippet and find its content like this:

# IT'S A PSEUDO CODE
containers = soup.find_all(class or id = 'somehting')
for tag in containers:
    print(f"tag text : {tag.text}")

Is there any way to do this?

NOTE: I'm also okay with using libraries other than beautifulsoup and requests as long as it can produce result.

There's no need to use Selenium , you can achieve this using requests and BS4 since everything you need is located in HTML and there's no dynamic JavaScript.

Code and example in online IDE :

from bs4 import BeautifulSoup
import requests, lxml

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

html = requests.get('https://www.google.com/search?q=Barack Obama born date', headers=headers).text

soup = BeautifulSoup(html, 'lxml')

born = soup.select_one('.XcVN5d').text
age = soup.select_one('.kZ91ed').text

print(born)
print(age)

Output:

August 4, 1961
age 59 years

Selenium will produce the result you need. It's convenient because you can add any waits and see what is actually going on on your screen.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


driver = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver')

driver.get('https://google.com/')
assert "Google" in driver.title
wait = WebDriverWait(driver, 20)
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, ".gLFyf.gsfi")))
input_field = driver.find_element_by_css_selector(".gLFyf.gsfi")
input_field.send_keys("how many people in the world")
input_field.send_keys(Keys.RETURN)

wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".Z0LcW.XcVN5d")))
result = driver.find_element_by_css_selector(".Z0LcW.XcVN5d").text
print(result)
driver.close()
driver.quit()

The result will probably wonder you:)

You'll need to install Selenium and Chromedriver . You'll need to put Chromedriver executable in the path for Windows, or show the path to it for Linux. My example is for Linux.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM