简体   繁体   English

使用 Python 抓取 Google - requests.get 的正确 URL 是什么?

[英]Scrape Google with Python - What is the correct URL for requests.get?

Goal : I would like to verify, if a specific Google search has a suggested result on the right hand side and - in case of such a suggestion - scrape some information like company type / address / etc.目标:我想验证,如果特定的谷歌搜索在右侧有建议的结果,并且 - 如果有这样的建议 - 抓取一些信息,如公司类型/地址等。

谷歌搜索结果页面,右侧有建议

Approach : I wanted to use a Python scraper with Requests and BeautifulSoup4方法:我想使用带有 Requests 和 BeautifulSoup4 的 Python 爬虫

import bs4
import requests

address='https://www.google.co.ve/?gws_rd=cr&ei=DgBqVpWJMoPA-gHy25fACg#q=caracas+arepa'
page = requests.get(address)
soup = bs4.BeautifulSoup(page.content,'html.parser')
print (soup.prettify())

Issue:问题:

The requested page does not include the search results (I am not sure if some variable on the Google page is set to invisible?), Rather only the header and footer of the Google page请求的页面不包括搜索结果(我不确定谷歌页面上的某些变量是否设置为不可见?),而只有谷歌页面的 header 和页脚

Questions:问题:

  1. Alternative ways to obtain the described information?获取所述信息的替代方法? Any ideas?有任何想法吗?

  2. Once I obtained results with the described method, but the respective address was constructed differently (I remember many numbers in the Google URL, but unfortunately cannot reproduce the search address).一旦我用描述的方法获得了结果,但是各自的地址构造不同(我记得谷歌URL中的许多数字,但不幸的是无法重现搜索地址)。 Therefore: Is there a requirement of the Google URL so that it can be scraped via requests.get?因此:是否需要 Google URL 才能通过 requests.get 抓取?

The best way to get information from a service like Google Places will almost always be the official API . 从诸如Google Places之类的服务中获取信息的最佳方法几乎总是官方API That said, if you're dead set on scraping, it's likely that what's returned by the HTTP request is meant for a browser to render. 就是说,如果您对抓取行为一无所知,那么HTTP请求返回的内容很可能是供浏览器呈现的。 What BeautifulSoup does is not equivalent to rendering the data it receives, so it's very likely you're just getting useless empty containers that are later filled out dynamically. BeautifulSoup所做的并不等同于呈现接收到的数据,因此很可能您只是获得了无用的空容器,这些容器随后会动态填充。

I think your question is similar to google-search-with-python-reqeusts , maybe you could get some help from that~ 我认为您的问题类似于google-search-with-python-reqeusts ,也许您可​​以从中获得一些帮助〜

And I agree with LiterallyElvis, API is better idea than crawl it directly. 我同意LiterallyElvis的观点,API比直接爬网更好。

Finally if you want to use requests for this work, I recommend to use PhantomJS and selenium to mock browser works, as Google should use some AJAX tech which makes different views between real browser and crawler. 最后,如果您想使用这项工作的请求,我建议使用PhantomJSselenium来模拟浏览器的工作,因为Google应该使用一些AJAX技术,从而在真实的浏览器和搜寻器之间产生不同的视图。

As in country of difficult to visit Google, I couldn't repeat your problem directly, the above are sth I could think about, wish it helps 由于在Google难以访问的国家/地区,我无法直接重复您的问题,以上是我可以考虑的内容,希望对您有所帮助

You need select_one() element (container) that contains all the needed data and check if it exists, and if so, scrape the data.您需要包含所有所需数据的select_one()元素(容器)并检查它if存在,如果存在,则抓取数据。

Make sure you're using user-agent to act as a "real" user visit, otherwise your request might be blocked or you receive a different HTML with different selectors.确保您使用user-agent来充当“真实”用户访问,否则您的请求可能会被阻止,或者您会收到带有不同选择器的不同 HTML。Check what's your user-agent .检查你的user-agent是什么

Code and example in the online IDE : 在线 IDE 中的代码和示例

from bs4 import BeautifulSoup
import requests, lxml

params = {
    "q": "caracas arepa bar google",
    "gl": "us"
}

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36"
}

html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")

# if right side knowledge graph is present -> parse the data.
if soup.select_one(".liYKde"):
    place_name = soup.select_one(".PZPZlf.q8U8x span").text
    place_type = soup.select_one(".YhemCb+ .YhemCb").text
    place_reviews = soup.select_one(".hqzQac span").text
    place_rating = soup.select_one(".Aq14fc").text

    print(place_name, place_type, place_reviews, place_rating, sep="\n")

# output:
'''
Caracas Arepa Bar
Venezuelan restaurant
1,123 Google reviews
4.5
'''

Alternatively, you can achieve the same thing using Google Knowledge Graph API from SerpApi.或者,您可以使用来自 SerpApi 的Google 知识图谱 API实现相同的目的。 It's a paid API with a free plan.这是带有免费计划的付费 API。

The biggest difference is that you don't need to figure out how to parse the data, increase the number of requests, bypass blocks from Google, and other search engines.最大的区别是您不需要弄清楚如何解析数据、增加请求数量、绕过 Google 和其他搜索引擎的阻止。

from serpapi import GoogleSearch

params = {
  "api_key": "YOUR_API_KEY",
  "engine": "google",
  "q": "caracas arepa bar place",
  "hl": "en"
}

search = GoogleSearch(params)
results = search.get_dict()

print(json.dumps([results["knowledge_graph"]], indent=2))


# part of the output:
'''
[
  {
    "title": "Caracas Arepa Bar",
    "type": "Venezuelan restaurant",
    "place_id": "ChIJVcQ2ll9ZwokRwmkvsArPXyo",
    "website": "http://caracasarepabar.com/",
    "description": "Arepa specialist offering creative, low-priced renditions of the Venezuelan corn-flour staple.",
    "local_map": {
      "image": "https://www.google.com/maps/vt/data=TF2Rd51PtEnU2M3pkZHYHKdSwhMDJ_ZwRfg0vfwlDRAmv1u919sgFl8hs_lo832ziTWxCZM9BKECs6Af-TA1hh0NLjuYAzOLFA1-RBEmj-8poygymcRX2KLNVTGGZZKDerZrKW6fnkONAM4Ui-BVN8XwFrwigoqqxObPg8bqFIgeM3LPCg",
      "link": "https://www.google.com/maps/place/Caracas+Arepa+Bar/@40.7131972,-73.9574167,15z/data=!4m2!3m1!1s0x0:0x2a5fcf0ab02f69c2?sa=X&hl=en",
      "gps_coordinates": {
        "latitude": 40.7131972,
        "longitude": -73.9574167,
        "altitude": 15
      }
    } ... much more results including place images, popular times, user reviews.
  }
]
'''

Disclaimer : I work for SerpApi.免责声明:我为 SerpApi 工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM