简体   繁体   中英

Scrape Google with Python - What is the correct URL for requests.get?

Goal : I would like to verify, if a specific Google search has a suggested result on the right hand side and - in case of such a suggestion - scrape some information like company type / address / etc.

谷歌搜索结果页面,右侧有建议

Approach : I wanted to use a Python scraper with Requests and BeautifulSoup4

import bs4
import requests

address='https://www.google.co.ve/?gws_rd=cr&ei=DgBqVpWJMoPA-gHy25fACg#q=caracas+arepa'
page = requests.get(address)
soup = bs4.BeautifulSoup(page.content,'html.parser')
print (soup.prettify())

Issue:

The requested page does not include the search results (I am not sure if some variable on the Google page is set to invisible?), Rather only the header and footer of the Google page

Questions:

  1. Alternative ways to obtain the described information? Any ideas?

  2. Once I obtained results with the described method, but the respective address was constructed differently (I remember many numbers in the Google URL, but unfortunately cannot reproduce the search address). Therefore: Is there a requirement of the Google URL so that it can be scraped via requests.get?

The best way to get information from a service like Google Places will almost always be the official API . That said, if you're dead set on scraping, it's likely that what's returned by the HTTP request is meant for a browser to render. What BeautifulSoup does is not equivalent to rendering the data it receives, so it's very likely you're just getting useless empty containers that are later filled out dynamically.

I think your question is similar to google-search-with-python-reqeusts , maybe you could get some help from that~

And I agree with LiterallyElvis, API is better idea than crawl it directly.

Finally if you want to use requests for this work, I recommend to use PhantomJS and selenium to mock browser works, as Google should use some AJAX tech which makes different views between real browser and crawler.

As in country of difficult to visit Google, I couldn't repeat your problem directly, the above are sth I could think about, wish it helps

You need select_one() element (container) that contains all the needed data and check if it exists, and if so, scrape the data.

Make sure you're using user-agent to act as a "real" user visit, otherwise your request might be blocked or you receive a different HTML with different selectors.Check what's your user-agent .

Code and example in the online IDE :

from bs4 import BeautifulSoup
import requests, lxml

params = {
    "q": "caracas arepa bar google",
    "gl": "us"
}

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36"
}

html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")

# if right side knowledge graph is present -> parse the data.
if soup.select_one(".liYKde"):
    place_name = soup.select_one(".PZPZlf.q8U8x span").text
    place_type = soup.select_one(".YhemCb+ .YhemCb").text
    place_reviews = soup.select_one(".hqzQac span").text
    place_rating = soup.select_one(".Aq14fc").text

    print(place_name, place_type, place_reviews, place_rating, sep="\n")

# output:
'''
Caracas Arepa Bar
Venezuelan restaurant
1,123 Google reviews
4.5
'''

Alternatively, you can achieve the same thing using Google Knowledge Graph API from SerpApi. It's a paid API with a free plan.

The biggest difference is that you don't need to figure out how to parse the data, increase the number of requests, bypass blocks from Google, and other search engines.

from serpapi import GoogleSearch

params = {
  "api_key": "YOUR_API_KEY",
  "engine": "google",
  "q": "caracas arepa bar place",
  "hl": "en"
}

search = GoogleSearch(params)
results = search.get_dict()

print(json.dumps([results["knowledge_graph"]], indent=2))


# part of the output:
'''
[
  {
    "title": "Caracas Arepa Bar",
    "type": "Venezuelan restaurant",
    "place_id": "ChIJVcQ2ll9ZwokRwmkvsArPXyo",
    "website": "http://caracasarepabar.com/",
    "description": "Arepa specialist offering creative, low-priced renditions of the Venezuelan corn-flour staple.",
    "local_map": {
      "image": "https://www.google.com/maps/vt/data=TF2Rd51PtEnU2M3pkZHYHKdSwhMDJ_ZwRfg0vfwlDRAmv1u919sgFl8hs_lo832ziTWxCZM9BKECs6Af-TA1hh0NLjuYAzOLFA1-RBEmj-8poygymcRX2KLNVTGGZZKDerZrKW6fnkONAM4Ui-BVN8XwFrwigoqqxObPg8bqFIgeM3LPCg",
      "link": "https://www.google.com/maps/place/Caracas+Arepa+Bar/@40.7131972,-73.9574167,15z/data=!4m2!3m1!1s0x0:0x2a5fcf0ab02f69c2?sa=X&hl=en",
      "gps_coordinates": {
        "latitude": 40.7131972,
        "longitude": -73.9574167,
        "altitude": 15
      }
    } ... much more results including place images, popular times, user reviews.
  }
]
'''

Disclaimer : I work for SerpApi.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM