简体   繁体   中英

Pull Data/Links from Google Searches using Beautiful Soup

Evening Folks,

I'm attempting to ask Google a question, and pull all the relevant links from its respected search query (ie I search "site: Wikipedia.com Thomas Jefferson" and it gives me wiki.com/jeff, wiki.com/tom, etc.)

Here's my code:

from bs4 import BeautifulSoup
from urllib2 import urlopen

query = 'Thomas Jefferson'

query.replace (" ", "+")
#replaces whitespace with a plus sign for Google compatibility purposes

soup = BeautifulSoup(urlopen("https://www.google.com/?gws_rd=ssl#q=site:wikipedia.com+" + query), "html.parser")
#creates soup and opens URL for Google. Begins search with site:wikipedia.com so only wikipedia
#links show up. Uses html parser.

for item in soup.find_all('h3', attrs={'class' : 'r'}):
    print item.string
#Guides BS to h3 class "r" where green Wikipedia URLs are located, then prints URLs
#Limiter code to only pull top 5 results

The goal here is for me to set the query variable, have python query Google, and Beautiful Soup pulls all the "green" links, if you will.

Here is a picture of a Google results page

I only wish to pull the green links, in their full extent. What's weird is that Google's Source Code is "hidden" (a symptom of their search architecture), so Beautiful Soup can't just go and pull a href from an h3 tag. I am able to see the h3 hrefs when I Inspect Element, but not when I view source.

Here is a picture of the Inspect Element

My question is: How do I go about pulling the top 5 most relevant green links from Google via BeautifulSoup if I cannot access their Source Code, only Inspect Element?

PS: To give an idea of what I am trying to accomplish, I have found two relatively close Stack Overflow questions like mine:

beautiful soup extract a href from google search

How to collect data of Google Search with beautiful soup using python

I got a different URL than Rob M. when I tried searching with JavaScript disabled -

https://www.google.com/search?q=site:wikipedia.com+Thomas+Jefferson&gbv=1&sei=YwHNVpHLOYiWmQHk3K24Cw

To make this work with any query, you should first make sure that your query has no spaces in it (that's why you'll get a 400: Bad Request). You can do this usingurllib.quote_plus() :

query = "Thomas Jefferson"
query = urllib.quote_plus(query)

which will urlencode all of the spaces as plus signs - creating a valid URL.

However , this does not work with urllib - you get a 403: Forbidden. I got it to work by using the python-requests module like this:

import requests
import urllib
from bs4 import BeautifulSoup

query = 'Thomas Jefferson'
query = urllib.quote_plus(query)

r = requests.get('https://www.google.com/search?q=site:wikipedia.com+{}&gbv=1&sei=YwHNVpHLOYiWmQHk3K24Cw'.format(query))
soup = BeautifulSoup(r.text, "html.parser")
#creates soup and opens URL for Google. Begins search with site:wikipedia.com so only wikipedia
#links show up. Uses html parser.

links = []
for item in soup.find_all('h3', attrs={'class' : 'r'}):
    links.append(item.a['href'][7:]) # [7:] strips the /url?q= prefix
#Guides BS to h3 class "r" where green Wikipedia URLs are located, then prints URLs
#Limiter code to only pull top 5 results

Printing links gives:

print links
#  [u'http://en.wikipedia.com/wiki/Thomas_Jefferson&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggUMAA&usg=AFQjCNG6INz_xj_-p7mpoirb4UqyfGxdWA',
#   u'http://www.wikipedia.com/wiki/Jefferson%25E2%2580%2593Hemings_controversy&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggeMAE&usg=AFQjCNEjCPY-HCdfHoIa60s2DwBU1ffSPg',
#   u'http://en.wikipedia.com/wiki/Sally_Hemings&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggjMAI&usg=AFQjCNGxy4i7AFsup0yPzw9xQq-wD9mtCw',
#   u'http://en.wikipedia.com/wiki/Monticello&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggoMAM&usg=AFQjCNE4YlDpcIUqJRGghuSC43TkG-917g',
#   u'http://en.wikipedia.com/wiki/Thomas_Jefferson_University&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggtMAQ&usg=AFQjCNEDuLjZwImk1G1OnNEnRhtJMvr44g',
#   u'http://www.wikipedia.com/wiki/Jane_Randolph_Jefferson&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggyMAU&usg=AFQjCNHmXJMI0k4Bf6j3b7QdJffKk97tAw',
#   u'http://en.wikipedia.com/wiki/United_States_presidential_election,_1800&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFgg3MAY&usg=AFQjCNEqsc9jDsDetf0reFep9L9CnlorBA',
#   u'http://en.wikipedia.com/wiki/Isaac_Jefferson&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFgg8MAc&usg=AFQjCNHKAAgylhRjxbxEva5IvDA_UnVrTQ',
#   u'http://en.wikipedia.com/wiki/United_States_presidential_election,_1796&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFghBMAg&usg=AFQjCNHviErFQEKbDlcnDZrqmxGuiBG9XA',
#   u'http://en.wikipedia.com/wiki/United_States_presidential_election,_1804&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFghGMAk&usg=AFQjCNEJZSxCuXE_Dzm_kw3U7hYkH7OtlQ']

You need to specify user-agent if you're getting empty results. It could be one of the reasons. I simplified the code a bit, removed the query variable that you have in your code.

Code andfull example to test:

from bs4 import BeautifulSoup
import requests
import lxml

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

response = requests.get(
    'https://www.google.com/search?q=site:wikipedia.com thomas edison',
    headers=headers).text

soup = BeautifulSoup(response, 'lxml')

for link in soup.find_all('div', class_='yuRUbf'):
    links = link.a['href']
    print(links)

Output:

https://en.wikipedia.com/wiki/Edison,_New_Jersey
https://en.wikipedia.com/wiki/Motion_Picture_Patents_Company
https://www.wikipedia.com/wiki/Thomas_E._Murray
https://en.wikipedia.com/wiki/Incandescent_light_bulb
https://en.wikipedia.com/wiki/Phonograph_cylinder
https://en.wikipedia.com/wiki/Emile_Berliner
https://wikipedia.com/wiki/Consolidated_Edison
https://www.wikipedia.com/wiki/hello
https://www.wikipedia.com/wiki/Tom%20Alston
https://en.wikipedia.com/wiki/Edison_screw

Alternatively, you can use Google Search Engine Results API from SerpApi:

Part JSON output:

{
 "position": 1,
 "title": "Thomas Edison - Wikipedia",
 "link": "https://en.wikipedia.org/wiki/Thomas_Edison",
 "displayed_link": "en.wikipedia.org › wiki › Thomas_Edison",
 "snippet": "Thomas Alva Edison (February 11, 1847 – October 18, 1931) was an American inventor and businessman who has been described as America's greatest ..."
}

Code to integrate:

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "site:wikipedia.com thomas edison",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results["organic_results"]:
     print(f"Link: {result['link']}")

Output:

Link: https://en.wikipedia.com/wiki/Edison,_New_Jersey
Link: https://en.wikipedia.com/wiki/Motion_Picture_Patents_Company
Link: https://www.wikipedia.com/wiki/Thomas_E._Murray
Link: https://en.wikipedia.com/wiki/Incandescent_light_bulb
Link: https://en.wikipedia.com/wiki/Phonograph_cylinder
Link: https://en.wikipedia.com/wiki/Emile_Berliner
Link: https://wikipedia.com/wiki/Consolidated_Edison
Link: https://www.wikipedia.com/wiki/hello
Link: https://www.wikipedia.com/wiki/Tom%20Alston
Link: https://en.wikipedia.com/wiki/Edison_screw

Disclaimer, I work for SerpApi.

This isn't going to work with the hash search ( #q=site:wikipedia.com like you have it) as that loads the data in via AJAX rather than serving you the full parseable HTML with the results, you should use this instead:

soup = BeautifulSoup(urlopen("https://www.google.com/search?gbv=1&q=site:wikipedia.com+" + query), "html.parser")

For reference, I disabled javascript and performed a google search to get this url structure.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM