简体   繁体   中英

Web scraping for Google Ads using python and beautifulsoup

I am trying to scrape for Google search results that have the "Ad" in the right, ie scraping for Google ad links from search results. I have the following script, where I am stuck at soup.select() step. I am not sure which selectors to use... Any help is appreciated in advance inspect element below: screen capture of inspect element

 #! python3 #!usr/bin/env python3 import requests, bs4, webbrowser #Get Google search results ui_search = input("Search google: ") print('Googling...') #display text while downloading if len(ui_search)>1: res = requests.get('https://google.com/search?q=' + ' '.join(ui_search)) res.raise_for_status() #Retrieve the results with ads and open them. soup = bs4.BeautifulSoup(res.text, 'html.parser') #Open a browser tab for each result linkElems = soup.select('.V0MxL a') linkElems2 = soup.select('.ad_cclk a') numOpen = min(5, len(linkElems)) print(numOpen) for i in range(numOpen): print(linkElems[i].get('href')) webbrowser.open('http://google.com' +linkElems[i].get('href'))

Code for similar code without specifying for ads:

 #! python3 #lucky.py - Opens several Google search results. import requests import sys import webbrowser import bs4 ui_search = input("Search google: ") print('Googling...') #display text while downloading if len(sys.argv) > 1: res = requests.get('http://google.com/search?q=' + ' '.join(sys.argv[1:])) elif len(ui_search) > 1: res = requests.get('http://google.com/search?q=' + ' '.join(ui_search)) res.raise_for_status() #Retrieve top search result links. soup = bs4.BeautifulSoup(res.text, 'html.parser') #type(soup) #Open a browser tab for each result linkElems = soup.select('.r a') numOpen = min(5, len(linkElems)) for i in range(numOpen): print(linkElems[i]) # webbrowser.open('http://google.com' + linkElems[i].get('href'))

Example results:

enter image description here

For this specific case, I would rather use findAll() / find_all() methods instead because this way I can get more specific and tell bs4 to choose a tag that contains specific class inside where I can grab ADs link URL's.

This will work only if Google shows those Ads while the script is running.

Code and full example :

from bs4 import BeautifulSoup
import requests
import lxml

headers = {
  "User-Agent":
  "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

html = requests.get('https://www.google.com/search?q=graphic+card+buy&oq=graphic+card+buy&hl=en&gl=us&sourceid=chrome&ie=UTF-8', headers=headers).text

soup = BeautifulSoup(html, 'lxml')

for link in soup.findAll('div', class_='RnJeZd top pla-unit-title'):
  ad_link = link.a['href']
  print(f'https://www.googleadservices.com/pagead{ad_link}')

Output:

https://www.googleadservices.com/pagead/aclk?sa=l&ai=DChcSEwils83_1PrvAhUNjsgKHdWRC7sYABAFGgJxdQ&sig=AOD64_39ASmacGcHYwy9gGKmKFRuPLiOQg&ctype=5&q=&ved=2ahUKEwinrcf_1PrvAhWFKs0KHZzNCsMQww96BAgCED0&adurl=
https://www.googleadservices.com/pagead/aclk?sa=l&ai=DChcSEwils83_1PrvAhUNjsgKHdWRC7sYABADGgJxdQ&sig=AOD64_2rqOA3PxFKKsigRh1yy3z5QKbtcw&ctype=5&q=&ved=2ahUKEwinrcf_1PrvAhWFKs0KHZzNCsMQww96BAgCEEk&adurl=
https://www.googleadservices.com/pagead/aclk?sa=L&ai=DChcSEwils83_1PrvAhUNjsgKHdWRC7sYABAEGgJxdQ&sig=AOD64_0WuY3UDlgTziPk9nUw0f8s3zW3nA&ctype=5&q=&ved=2ahUKEwinrcf_1PrvAhWFKs0KHZzNCsMQww96BAgCEFU&adurl=

Alternatively, you can use Google Ad Results API from SerpApi. It's a paid API with a free trial. Check out the Playground to play around.

Code to integrate:

import os
from serpapi import GoogleSearch

params = {
  "engine": "google",
  "q": "graphic card buy",
  "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for ads in results["shopping_results"]:
   print(f"Ad link: {ads['link']}")

Part of JSON output:

"shopping_results": [
  {
    "position": 1,
    "block_position": "top",
    "title": "MSI GeForce GTX 1050 Ti DirectX 12 GTX 1050 Ti GAMING X 4G 4GB 128-Bit GDDR5 PCI Express 3.0 x16 HDCP Ready ATX Video Card",
    "price": "$378.96",
    "extracted_price": 378.96,
    "link": "https://www.google.com/aclk?sa=l&ai=DChcSEwiSrfjp0_rvAhUPwMgKHSkHDA8YABAFGgJxdQ&sig=AOD64_0JBh0DChYqoc9ZOWb2n74I16DHbQ&ctype=5&q=&ved=2ahUKEwjFuvDp0_rvAhVLX60KHS1kB9sQ5bgDegQIARA9&adurl=",
    "source": "Newegg.com",
    "rating": 4.7,
    "reviews": 1000,
    "thumbnail": "https://serpapi.com/searches/60754272d036a77fb9aba998/images/948de68641995317c7afd6bdf3a72f7a9d55cad671345a21921df988b8b9ef6c.png"
  }
]

Disclaimer, I work for SerpApi.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM