I'm trying to make a script that will scrape the first link of a google search so that it will give me back only the first link so I can run a search in the terminal and look at the link later on with the search term. I'm struggling to only get the first result. This is the closest thing I've got so far.
import requests
from bs4 import BeautifulSoup
research_later = "hiya"
goog_search = "https://www.google.co.uk/search?sclient=psy-ab&client=ubuntu&hs=k5b&channel=fs&biw=1366&bih=648&noj=1&q=" + research_later
r = requests.get(goog_search)
soup = BeautifulSoup(r.text)
for link in soup.find_all('a'):
print research_later + " :"+link.get('href')
Seems like Google use cite
tag to save the link, so we can just use soup.find('cite').text
like this:
import requests
from bs4 import BeautifulSoup
research_later = "hiya"
goog_search = "https://www.google.co.uk/search?sclient=psy-ab&client=ubuntu&hs=k5b&channel=fs&biw=1366&bih=648&noj=1&q=" + research_later
r = requests.get(goog_search)
soup = BeautifulSoup(r.text, "html.parser")
print soup.find('cite').text
Output is:
www.urbandictionary.com/define.php?term=hiya
You can use either select_one()
for selecting CSS
selectors or find()
bs4
methods to get only one element from the page. To grab CSS
selectors you can use SelectorGadget extension.
Code and example in the online IDE :
from bs4 import BeautifulSoup
import requests, json
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('https://www.google.com/search?q=ice cream', headers=headers).text
soup = BeautifulSoup(html, 'lxml')
# locating div element with a tF2Cxc class
# calling for <a> tag and then calling for 'href' attribute
link = soup.find('div', class_='tF2Cxc').a['href']
print(link)
# output:
'''
https://en.wikipedia.org/wiki/Ice_cream
'''
Alternatively, you can do the same thing by using Google Search Engine Results API from SerpApi. It's a paid API with a free trial of 5,000 searches.
The main difference is that everything (selecting, bypass blocks, proxy rotation, and more) is already done for the end-user with a json
output.
Code to integrate:
params = {
"engine": "google",
"q": "ice cream",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
# [0] - first index from the search results
link = results['organic_results'][0]['link']
print(link)
# output:
'''
https://en.wikipedia.org/wiki/Ice_cream
'''
Disclaimer, I work for SerpApi.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.