简体   繁体   中英

Getting Google Search Result URLs from Search String or URL

So I want to find all the search results and store them in a list or something. Analysing the Google page give me that all results are technically in the g class:

谷歌搜索分析

So technically, extracting an URL (ie) from the search results page should be as easy as:

import urllib
from bs4 import BeautifulSoup
import requests

text = 'cyber security'
text = urllib.parse.quote_plus(text)

url = 'https://google.com/search?q=' + text

response = requests.get(url)

soup = BeautifulSoup(response.content, 'lxml')
for result_divs in soup.find_all(class_='g'):
    links = [div.find('a') for div in result_divs]
    hrefs = [link.get('href') for link in links]
    print(hrefs)

And yet, I have no output. Why?

Edit: Even manually parsing the stored page doesn't help:

with open('output.html', 'wb') as f:
     f.write(response.content)
webbrowser.open('output.html')

url = "output.html"
page = open(url)
soup = BeautifulSoup(page.read(), features="lxml")

#soup = BeautifulSoup(response.content, 'lxml')
for result_divs in soup.find_all(class_='g'):
    links = [div.find('a') for div in result_divs]
    hrefs = [link.get('href') for link in links]
    print(hrefs)

The following approach should fetch you few random links out of the total result links from it's landing page. You may need to kick out some links ending with dots. It's really a difficult job to grab links from google search using requests.

import requests
from bs4 import BeautifulSoup

url = "http://www.google.com/search?q={}&hl=en"

def scrape_google_links(query):
    res = requests.get(url.format(query.replace(" ","+")),headers={"User-Agent":"Mozilla/5.0"})
    soup = BeautifulSoup(res.text,"lxml")
    for result in soup.select(".kCrYT > a > .BNeawe:nth-of-type(2)"):
        print(result.text.replace(" › ","/"))

if __name__ == '__main__':
    scrape_google_links('cyber security')

You can always climb several elements up or down to test out using next_sibling / previous_sibling or next_element / previous_element . All results are in the <div> element with .tF2Cxc class.

Scrape URLs is as easy as:

  1. make a for loop in combo with bs4 .select() method that which takes СSS selectors as an input.
  2. call .yuRUbf CSS selector with .select_one() method.
  3. call <a> tag with href attribute.
for result in soup.select('.tF2Cxc'):
  link = result.select_one('.yuRUbf').a['href']

Code and example in the online IDE :

import requests, lxml
from bs4 import BeautifulSoup

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {'q': 'cyber security'}
html = requests.get('https://www.google.com/search', headers=headers, params=params).text
soup = BeautifulSoup(html, 'lxml')

# containver with all needed data
for result in soup.select('.tF2Cxc'):
  link = result.select_one('.yuRUbf').a['href'] # or ('.yuRUbf a')['href']
  print(link)

# output:
'''
https://www.kaspersky.com/resource-center/definitions/what-is-cyber-security
https://www.cisco.com/c/en/us/products/security/what-is-cybersecurity.html
https://digitalguardian.com/blog/what-cyber-security
https://searchsecurity.techtarget.com/definition/cybersecurity
https://www.cisa.gov/cybersecurity
https://en.wikipedia.org/wiki/Computer_security
https://www.csoonline.com/article/3482001/what-is-cyber-security-types-careers-salary-and-certification.html
https://staysafeonline.org/
'''

Alternatively, you can do the same thing using Google Organic Results API from SerpApi. It's a paid API with a free plan.

Code to integrate:

params = {
  "api_key": os.getenv("API_KEY"), # environment for API_KEY
  "engine": "google", # search engine
  "q": "cyber security", # query
  "hl": "en", # defining a language
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results['organic_results']:
  link = result['link']
  print(link)

# output:
'''
https://www.kaspersky.com/resource-center/definitions/what-is-cyber-security
https://digitalguardian.com/blog/what-cyber-security
https://en.wikipedia.org/wiki/Computer_security
https://www.cisco.com/c/en/us/products/security/what-is-cybersecurity.html
https://staysafeonline.org/
https://searchsecurity.techtarget.com/definition/cybersecurity
https://www.cisa.gov/cybersecurity
https://www.csoonline.com/article/3482001/what-is-cyber-security-types-careers-salary-and-certification.html
'''

Disclaimer, I work for SerpApi.

from selenium import webdriver
from bs4 import BeautifulSoup
import time

browser = webdriver.Firefox()
dork = 'cyber security'
sada = browser.get(f"https://www.google.com/search?q={dork}")
time.sleep(5)
source = browser.page_source
soup = BeautifulSoup(source, 'html.parser')

for item in soup.findAll('div', attrs={'class': 'r'}):
    for href in item.findAll('a'):
        print(href.get('href'))

Actually, If you print the response.content and check the output you will find that there is no HTML tag with class g . It seems that these elements are coming via dynamic loading and BeautifulSoap loads the static content only. That is why when you look for HTML tags with class g it doesn't give any element in result.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM