I am trying to scrape for Google search results that have the "Ad" in the right, ie scraping for Google ad links from search results. I have the following script, where I am stuck at soup.select() step. I am not sure which selectors to use... Any help is appreciated in advance inspect element below: screen capture of inspect element
#! python3 #!usr/bin/env python3 import requests, bs4, webbrowser #Get Google search results ui_search = input("Search google: ") print('Googling...') #display text while downloading if len(ui_search)>1: res = requests.get('https://google.com/search?q=' + ' '.join(ui_search)) res.raise_for_status() #Retrieve the results with ads and open them. soup = bs4.BeautifulSoup(res.text, 'html.parser') #Open a browser tab for each result linkElems = soup.select('.V0MxL a') linkElems2 = soup.select('.ad_cclk a') numOpen = min(5, len(linkElems)) print(numOpen) for i in range(numOpen): print(linkElems[i].get('href')) webbrowser.open('http://google.com' +linkElems[i].get('href'))
Code for similar code without specifying for ads:
#! python3 #lucky.py - Opens several Google search results. import requests import sys import webbrowser import bs4 ui_search = input("Search google: ") print('Googling...') #display text while downloading if len(sys.argv) > 1: res = requests.get('http://google.com/search?q=' + ' '.join(sys.argv[1:])) elif len(ui_search) > 1: res = requests.get('http://google.com/search?q=' + ' '.join(ui_search)) res.raise_for_status() #Retrieve top search result links. soup = bs4.BeautifulSoup(res.text, 'html.parser') #type(soup) #Open a browser tab for each result linkElems = soup.select('.r a') numOpen = min(5, len(linkElems)) for i in range(numOpen): print(linkElems[i]) # webbrowser.open('http://google.com' + linkElems[i].get('href'))
Example results:
For this specific case, I would rather use findAll()
/ find_all()
methods instead because this way I can get more specific and tell bs4
to choose a tag
that contains specific class
inside where I can grab ADs link URL's.
This will work only if Google shows those Ads while the script is running.
Code and full example :
from bs4 import BeautifulSoup
import requests
import lxml
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('https://www.google.com/search?q=graphic+card+buy&oq=graphic+card+buy&hl=en&gl=us&sourceid=chrome&ie=UTF-8', headers=headers).text
soup = BeautifulSoup(html, 'lxml')
for link in soup.findAll('div', class_='RnJeZd top pla-unit-title'):
ad_link = link.a['href']
print(f'https://www.googleadservices.com/pagead{ad_link}')
Output:
https://www.googleadservices.com/pagead/aclk?sa=l&ai=DChcSEwils83_1PrvAhUNjsgKHdWRC7sYABAFGgJxdQ&sig=AOD64_39ASmacGcHYwy9gGKmKFRuPLiOQg&ctype=5&q=&ved=2ahUKEwinrcf_1PrvAhWFKs0KHZzNCsMQww96BAgCED0&adurl=
https://www.googleadservices.com/pagead/aclk?sa=l&ai=DChcSEwils83_1PrvAhUNjsgKHdWRC7sYABADGgJxdQ&sig=AOD64_2rqOA3PxFKKsigRh1yy3z5QKbtcw&ctype=5&q=&ved=2ahUKEwinrcf_1PrvAhWFKs0KHZzNCsMQww96BAgCEEk&adurl=
https://www.googleadservices.com/pagead/aclk?sa=L&ai=DChcSEwils83_1PrvAhUNjsgKHdWRC7sYABAEGgJxdQ&sig=AOD64_0WuY3UDlgTziPk9nUw0f8s3zW3nA&ctype=5&q=&ved=2ahUKEwinrcf_1PrvAhWFKs0KHZzNCsMQww96BAgCEFU&adurl=
Alternatively, you can use Google Ad Results API from SerpApi. It's a paid API with a free trial. Check out the Playground to play around.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "graphic card buy",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for ads in results["shopping_results"]:
print(f"Ad link: {ads['link']}")
Part of JSON output:
"shopping_results": [
{
"position": 1,
"block_position": "top",
"title": "MSI GeForce GTX 1050 Ti DirectX 12 GTX 1050 Ti GAMING X 4G 4GB 128-Bit GDDR5 PCI Express 3.0 x16 HDCP Ready ATX Video Card",
"price": "$378.96",
"extracted_price": 378.96,
"link": "https://www.google.com/aclk?sa=l&ai=DChcSEwiSrfjp0_rvAhUPwMgKHSkHDA8YABAFGgJxdQ&sig=AOD64_0JBh0DChYqoc9ZOWb2n74I16DHbQ&ctype=5&q=&ved=2ahUKEwjFuvDp0_rvAhVLX60KHS1kB9sQ5bgDegQIARA9&adurl=",
"source": "Newegg.com",
"rating": 4.7,
"reviews": 1000,
"thumbnail": "https://serpapi.com/searches/60754272d036a77fb9aba998/images/948de68641995317c7afd6bdf3a72f7a9d55cad671345a21921df988b8b9ef6c.png"
}
]
Disclaimer, I work for SerpApi.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.