简体   繁体   English

REST:Glassdoor API在标头中需要User-Agent

[英]REST: Glassdoor API requires User-Agent in header

This is related to this question. 这与问题有关。 I was trying to query the Glassdoor public API using the parameters documented, but kept getting a 403 Forbidden response. 我试图使用记录的参数查询Glassdoor公共API,但始终收到403 Forbidden响应。 To make sure that the query parameters were being used to create the URL correctly, I took the composed query URL and tried it in my browser and it worked. 为了确保使用查询参数正确地创建URL,我使用了组合查询URL并在浏览器中对其进行了尝试,然后它可以正常工作。

Working backwards from the query that my browser was making, I managed to figure out that the user agent needs to not only be a parameter in the URL, but also needs to be passed in the header. 从浏览器进行的查询中回溯,我设法弄清楚用户代理不仅需要是URL中的参数,而且还需要在标头中传递。

So putting this all together, here is code that will query the Glassdoor public API succcessfully: 因此,将所有这些放在一起,下面的代码将成功查询Glassdoor公共API:

import urllib.request as request
import requests
import json
from collections import OrderedDict

# authentication information & other request parameters
params_gd = OrderedDict({
    "v": "1",
    "format": "json",
    "t.p": "xxxxxx",
    "t.k": "yyyyyyyy",
    "action": "employers",
    "employerID": "11111",
    # programmatically get the IP of the machine
    "userip": json.loads(request.urlopen("http://ip.jsontest.com/").read().decode('utf-8'))['ip'],
    "useragent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36"
})

# construct the URL from parameters
basepath_gd = 'http://api.glassdoor.com/api/api.htm'

# request the API
response_gd = requests.get(basepath_gd,
                           params=params_gd,
                           headers={
                               "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36"
                           })
# check the response code (should be 200)  & the content
response_gd
response_gd.content

My question is -- why does the User-Agent need to be specified in the query header when it is already a part of the URL parameters? 我的问题是-为什么在已经包含URL参数的查询标头中需要指定User-Agent Shouldn't the query work without the user agent header? 没有用户代理标头,查询是否不起作用?

fg, FG,

Some providers don't like serving data to automated tools that may simply be scraping their data... one of the ways they "can tell" that they're serving a "person" and not some sort of whacky Python script is by checking the User-Agent header normally applied by the browser. 一些提供者不喜欢将数据提供给可能只是在抓取其数据的自动化工具...他们“可以告诉”他们正在提供“人员”而不是某种古怪的Python脚本的方式之一是通过检查通常由浏览器应用的User-Agent标头。

In this specific instance, Glassdoor has published their API Terms here , and from the top of page three they state "We reserve the right to limit or block applications that make a large number of calls to the Glassdoor API that are not primarily in response to the direct actions of individual end users." 在这种特定情况下,Glassdoor已在此处发布了其API条款 ,并从第三页的顶部声明:“我们保留限制或阻止对Glassdoor API进行大量调用的应用程序的权利,而这些调用并非主要是为了响应最终用户的直接行动。”

I'm inclined to think that this is enforced by looking for Header: User-Agent, but most companies will not explicitly state how they enforce this. 我倾向于认为这是通过查找Header:User-Agent来强制执行的,但是大多数公司不会明确说明他们如何强制执行此操作。 They also require that you display their logo and link to their home page on the approved webpage/site on which you display their data. 他们还要求您显示其徽标,并链接到在其上显示其数据的已批准网页/站点上的其主页。

Hope this helps. 希望这可以帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM