So I want to find all the search results and store them in a list or something. Analysing the Google page give me that all results are technically in the g
class:
So technically, extracting an URL (ie) from the search results page should be as easy as:
import urllib
from bs4 import BeautifulSoup
import requests
text = 'cyber security'
text = urllib.parse.quote_plus(text)
url = 'https://google.com/search?q=' + text
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
for result_divs in soup.find_all(class_='g'):
links = [div.find('a') for div in result_divs]
hrefs = [link.get('href') for link in links]
print(hrefs)
And yet, I have no output. Why?
Edit: Even manually parsing the stored page doesn't help:
with open('output.html', 'wb') as f:
f.write(response.content)
webbrowser.open('output.html')
url = "output.html"
page = open(url)
soup = BeautifulSoup(page.read(), features="lxml")
#soup = BeautifulSoup(response.content, 'lxml')
for result_divs in soup.find_all(class_='g'):
links = [div.find('a') for div in result_divs]
hrefs = [link.get('href') for link in links]
print(hrefs)
The following approach should fetch you few random links out of the total result links from it's landing page. You may need to kick out some links ending with dots. It's really a difficult job to grab links from google search using requests.
import requests
from bs4 import BeautifulSoup
url = "http://www.google.com/search?q={}&hl=en"
def scrape_google_links(query):
res = requests.get(url.format(query.replace(" ","+")),headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text,"lxml")
for result in soup.select(".kCrYT > a > .BNeawe:nth-of-type(2)"):
print(result.text.replace(" › ","/"))
if __name__ == '__main__':
scrape_google_links('cyber security')
You can always climb several elements up or down to test out using next_sibling
/ previous_sibling
or next_element
/ previous_element
. All results are in the <div>
element with .tF2Cxc
class.
Scrape URLs is as easy as:
for loop
in combo with bs4
.select()
method that which takes СSS selectors as an input..yuRUbf
CSS selector with .select_one()
method.<a>
tag with href
attribute.for result in soup.select('.tF2Cxc'):
link = result.select_one('.yuRUbf').a['href']
Code and example in the online IDE :
import requests, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {'q': 'cyber security'}
html = requests.get('https://www.google.com/search', headers=headers, params=params).text
soup = BeautifulSoup(html, 'lxml')
# containver with all needed data
for result in soup.select('.tF2Cxc'):
link = result.select_one('.yuRUbf').a['href'] # or ('.yuRUbf a')['href']
print(link)
# output:
'''
https://www.kaspersky.com/resource-center/definitions/what-is-cyber-security
https://www.cisco.com/c/en/us/products/security/what-is-cybersecurity.html
https://digitalguardian.com/blog/what-cyber-security
https://searchsecurity.techtarget.com/definition/cybersecurity
https://www.cisa.gov/cybersecurity
https://en.wikipedia.org/wiki/Computer_security
https://www.csoonline.com/article/3482001/what-is-cyber-security-types-careers-salary-and-certification.html
https://staysafeonline.org/
'''
Alternatively, you can do the same thing using Google Organic Results API from SerpApi. It's a paid API with a free plan.
Code to integrate:
params = {
"api_key": os.getenv("API_KEY"), # environment for API_KEY
"engine": "google", # search engine
"q": "cyber security", # query
"hl": "en", # defining a language
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
link = result['link']
print(link)
# output:
'''
https://www.kaspersky.com/resource-center/definitions/what-is-cyber-security
https://digitalguardian.com/blog/what-cyber-security
https://en.wikipedia.org/wiki/Computer_security
https://www.cisco.com/c/en/us/products/security/what-is-cybersecurity.html
https://staysafeonline.org/
https://searchsecurity.techtarget.com/definition/cybersecurity
https://www.cisa.gov/cybersecurity
https://www.csoonline.com/article/3482001/what-is-cyber-security-types-careers-salary-and-certification.html
'''
Disclaimer, I work for SerpApi.
from selenium import webdriver
from bs4 import BeautifulSoup
import time
browser = webdriver.Firefox()
dork = 'cyber security'
sada = browser.get(f"https://www.google.com/search?q={dork}")
time.sleep(5)
source = browser.page_source
soup = BeautifulSoup(source, 'html.parser')
for item in soup.findAll('div', attrs={'class': 'r'}):
for href in item.findAll('a'):
print(href.get('href'))
Actually, If you print the response.content and check the output you will find that there is no HTML tag with class g . It seems that these elements are coming via dynamic loading and BeautifulSoap loads the static content only. That is why when you look for HTML tags with class g it doesn't give any element in result.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.