简体   繁体   English

使用工具时 URL 被禁止 403 但浏览器没问题

[英]URL forbidden 403 when using a tool but fine from browser

I have some images that I need to do a HttpRequestMethod.HEAD in order to find out some details of the image.我有一些图像需要执行 HttpRequestMethod.HEAD 以找出图像的一些细节。

When I go to the image url on a browser it loads without a problem.当我在浏览器上访问图像 url 时,它可以毫无问题地加载。

When I attempt to get the Header info via my code or via online tools it fails当我尝试通过我的代码或通过在线工具获取标题信息时失败

An example URL is http://www.adorama.com/images/large/CHHB74P.JPG一个示例 URL 是http://www.adorama.com/images/large/CHHB74P.JPG

As mentioned, I have used the online tool Hurl.It to try and attain the Head request but I am getting the same 403 Forbidden message that I am getting in my code.如前所述,我使用了在线工具Hurl.It来尝试获得 Head 请求,但我收到的 403 Forbidden 消息与我在代码中收到的消息相同。 I have tried adding many various headers to the Head request (User-Agent, Accept, Accept-Encoding, Accept-Language, Cache-Control, Connection, Host, Pragma, Upgrade-Insecure-Requests) but none of this seems to work.我尝试向 Head 请求(用户代理、接受、接受编码、接受语言、缓存控制、连接、主机、编译指示、升级不安全请求)添加许多不同的标头,但这些似乎都不起作用。

It also fails to do a normal GET request via Hurl.it.它也无法通过 Hurl.it 执行正常的 GET 请求。 Same 403 error.同样的 403 错误。

If it is relevant, my code is ac# web service and is running on the AWS cloud (just in case the adorama servers have something against AWS that I dont know about).如果相关,我的代码是 ac# web service 并且在 AWS 云上运行(以防万一 adorama 服务器有一些我不知道的针对 AWS 的东西)。 To test this I have also spun up an ec2 (linux box) and run curl which also returned the 403 error.为了测试这一点,我还启动了一个 ec2(linux box)并运行 curl,它也返回了 403 错误。 Running curl locally on my personal computer returns the binary image which is presumably just the image data.在我的个人计算机上本地运行 curl 会返回二进制图像,这可能只是图像数据。

And just to remove the obvious thoughts, my code works successfully for many many other websites, it is just this one where there is an issue并且只是为了消除明显的想法,我的代码在许多其他网站上都可以成功运行,这只是一个存在问题的网站

Any idea what is required for me to download the image headers and not get the 403?知道我需要什么才能下载图像标题而不是 403 吗?

same problem here.同样的问题在这里。

Locally it works smoothly.在本地它运行顺利。 Doing it from an AWS instance I get the very same problem.从 AWS 实例执行此操作时,我遇到了同样的问题。

I thought it was a DNS resolution problem (redirecting to a malfunctioning node).我认为这是 DNS 解析问题(重定向到故障节点)。 I have therefore tried to specify the same IP address as it was resolved by my client but didn't fix the problem.因此,我尝试指定与我的客户端解析的 IP 地址相同的 IP 地址,但没有解决问题。

My guess is that Akamai (the service is provided by an Akamai CDN in this case) is blocking AWS.我的猜测是 Akamai(在这种情况下该服务由 Akamai CDN 提供)正在阻止 AWS。 It is understandable somehow, customers pay by traffic for CDN, by abusing it, people can generate huge bills.不知何故可以理解,客户为CDN按流量付费,通过滥用它,人们可以产生巨额账单。

Connecting to www.adorama.com (www.adorama.com)|104.86.164.205|:80... connected.正在连接到 www.adorama.com (www.adorama.com)|104.86.164.205|:80... 已连接。

HTTP request sent, awaiting response... 
HTTP/1.1 403 Forbidden
Server: **AkamaiGHost**
Mime-Version: 1.0
Content-Type: text/html
Content-Length: 301
Cache-Control: max-age=604800
Date: Wed, 23 Mar 2016 09:34:20 GMT
Connection: close
2016-03-23 09:34:20 ERROR 403: Forbidden.

I tried that URL from Amazon and it didn't work for me.我尝试了来自 Amazon 的 URL,但它对我不起作用。 wget did work from other servers that weren't on Amazon EC2 however.但是,wget 确实可以从不在 Amazon EC2 上的其他服务器上工作。 Here is the wget output on EC2这是 EC2 上的 wget 输出

wget -S http://www.adorama.com/images/large/CHHB74P.JPG
--2016-03-23 08:42:33--  http://www.adorama.com/images/large/CHHB74P.JPG
Resolving www.adorama.com... 23.40.219.79
Connecting to www.adorama.com|23.40.219.79|:80... connected.
HTTP request sent, awaiting response... 
  HTTP/1.0 403 Forbidden
  Server: AkamaiGHost
  Mime-Version: 1.0
  Content-Type: text/html
  Content-Length: 299
  Cache-Control: max-age=604800
  Date: Wed, 23 Mar 2016 08:42:33 GMT
  Connection: close
2016-03-23 08:42:33 ERROR 403: Forbidden.

But from another Linux host it did work.但是从另一个 Linux 主机它确实有效。 Here is output这是输出

wget -S http://www.adorama.com/images/large/CHHB74P.JPG
--2016-03-23 08:43:11--  http://www.adorama.com/images/large/CHHB74P.JPG
Resolving www.adorama.com... 23.45.139.71
Connecting to www.adorama.com|23.45.139.71|:80... connected.
HTTP request sent, awaiting response... 
  HTTP/1.0 200 OK
  Content-Type: image/jpeg
  Last-Modified: Wed, 23 Mar 2016 08:41:57 GMT
  Server: Microsoft-IIS/8.5
  X-AspNet-Version: 2.0.50727
  X-Powered-By: ASP.NET
  ServerID: C01
  Content-Length: 15131
  Cache-Control: private, max-age=604800
  Date: Wed, 23 Mar 2016 08:43:11 GMT
  Connection: keep-alive
  Set-Cookie: 1YDT=CT; expires=Wed, 20-Apr-2016 08:43:11 GMT; path=/; domain=.adorama.com
  P3P: CP="NON DSP ADM DEV PSD OUR IND STP PHY PRE NAV UNI"
Length: 15131 (15K) [image/jpeg]
Saving to: \u201cCHHB74P.JPG\u201d

100%[=====================================>] 15,131      --.-K/s   in 0s      

2016-03-23 08:43:11 (460 MB/s) - \u201cCHHB74P.JPG\u201d saved [15131/15131]

I would guess that the image provider is deliberately blocking requests from EC2 address ranges.我猜想图像提供者是故意阻止来自 EC2 地址范围的请求。

The reason the wget outgoing ip address is different in the two examples is due to DNS resolution on the cdn provider that adorama are providing两个示例中 wget 传出 ip 地址不同的原因是由于 adorama 提供的 cdn 提供商的 DNS 解析

Web Server may implement ways to check particular fingerprint attributes to prevent automated bots . Web Server 可能会实施检查特定指纹属性的方法,以防止自动机器人。 Here a few of them they can check这里有一些他们可以检查

  • Geoip, IP地理IP,IP
  • Browser headers浏览器标题
  • User agents用户代理
  • plugin info插件信息
  • Browser fonts return浏览器字体返回

You may simulate the browser header and learn some fingerprinting "attributes" here : https://panopticlick.eff.org您可以在此处模拟浏览器标题并学习一些指纹“属性”: https : //panopticlick.eff.org

You can try replicate how a browser behave and inject similar headers/user-agent.您可以尝试复制浏览器的行为方式并注入类似的标头/用户代理。 Plain curl/wget are not likely to satisfied those condition, even tools like phantomjs occasionally get blocked.普通的 curl/wget 不太可能满足这些条件,甚至像 phantomjs 这样的工具偶尔也会​​被阻止。 There is a reason why some prefer tools like selenium webdriver that launch actual browser.有些人更喜欢像 selenium webdriver 这样启动实际浏览器的工具是有原因的。

I found using another url also being protected by AkamaiGHost was blocking due to certain parts in the user agent.我发现使用另一个也受 AkamaiGHost 保护的 url 由于用户代理中的某些部分而被阻止。 Particulary using a link with protocol was blocked:特别是使用带有协议的链接被阻止:

Using curl -H 'User-Agent: some-user-agent' https://some.website I found the following results for different user agents:使用curl -H 'User-Agent: some-user-agent' https://some.website我发现不同用户代理的以下结果:

  • Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:70.0) Gecko/20100101 Firefox/70.0 okay Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:70.0) Gecko/20100101 Firefox/70.0 OK
  • facebookexternalhit/1.1 (+http\\://www.facebook.com/externalhit_uatext.php) : 403 facebookexternalhit/1.1 (+http\\://www.facebook.com/externalhit_uatext.php) :403
  • https ://bar : okay https ://bar : 好的
  • https://bar : 403 https://bar :403

All I could find for now is this (downvoted) answer https://stackoverflow.com/a/48137940/230422 stating that colons ( : ) are not allowed in header values.我现在能找到的就是这个(被否决的)答案https://stackoverflow.com/a/48137940/230422指出标头值中不允许使用冒号 ( : )。 That is clearly not the only thing happening here as the Mozilla example also has a colon, only not a link.这显然不是这里发生的唯一事情,因为 Mozilla 示例也有一个冒号,只是没有链接。

I guess that at least most webservers don't care and allow facebook's bot and other bots having a contact url in their user agent.我想至少大多数网络服务器不关心并允许 facebook 的机器人和其他机器人在他们的用户代理中有一个联系 url。 But appearently AkamaiGHost does block it.但似乎 AkamaiGHost 确实阻止了它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM