简体   繁体   中英

How can I scrape the first link of a google search with beautiful soup

I'm trying to make a script that will scrape the first link of a google search so that it will give me back only the first link so I can run a search in the terminal and look at the link later on with the search term. I'm struggling to only get the first result. This is the closest thing I've got so far.

import requests
from bs4 import BeautifulSoup

research_later = "hiya"
goog_search = "https://www.google.co.uk/search?sclient=psy-ab&client=ubuntu&hs=k5b&channel=fs&biw=1366&bih=648&noj=1&q=" + research_later


r = requests.get(goog_search)    
soup = BeautifulSoup(r.text)  

for link in soup.find_all('a'):
    print research_later + " :"+link.get('href')

Seems like Google use cite tag to save the link, so we can just use soup.find('cite').text like this:

import requests
from bs4 import BeautifulSoup

research_later = "hiya"
goog_search = "https://www.google.co.uk/search?sclient=psy-ab&client=ubuntu&hs=k5b&channel=fs&biw=1366&bih=648&noj=1&q=" + research_later


r = requests.get(goog_search)

soup = BeautifulSoup(r.text, "html.parser")
print soup.find('cite').text

Output is:

www.urbandictionary.com/define.php?term=hiya

You can use either select_one() for selecting CSS selectors or find() bs4 methods to get only one element from the page. To grab CSS selectors you can use SelectorGadget extension.

Code and example in the online IDE :

from bs4 import BeautifulSoup
import requests, json

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}


html = requests.get('https://www.google.com/search?q=ice cream', headers=headers).text
soup = BeautifulSoup(html, 'lxml')
# locating div element with a tF2Cxc class
# calling for <a> tag and then calling for 'href' attribute
link = soup.find('div', class_='tF2Cxc').a['href']
print(link)

# output:
'''
https://en.wikipedia.org/wiki/Ice_cream
'''

Alternatively, you can do the same thing by using Google Search Engine Results API from SerpApi. It's a paid API with a free trial of 5,000 searches.

The main difference is that everything (selecting, bypass blocks, proxy rotation, and more) is already done for the end-user with a json output.

Code to integrate:

params = {
    "engine": "google",
    "q": "ice cream",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

# [0] - first index from the search results
link = results['organic_results'][0]['link']
print(link)

# output:
'''
https://en.wikipedia.org/wiki/Ice_cream
'''

Disclaimer, I work for SerpApi.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM