简体   繁体   English

Perl:一些网站阻止非浏览器请求。 但是怎么做?

[英]Perl: Some websites block non-browser requests. But how?

I'm writing a simple Perl script that fetches some pages from different sites.我正在编写一个简单的 Perl 脚本,该脚本从不同的站点获取一些页面。 It's very non-intrusive.这是非常非侵入性的。 I don't hog a servers bandwidth.我不占用服务器带宽。 It retrieves a single page without loading any extra javascript, or images, or style sheets.它检索单个页面而不加载任何额外的 javascript、图像或样式表。

I use LWP::UserAgent to retrieve the pages.我使用 LWP::UserAgent 来检索页面。 This works fine on most sites but there are some sites that return a "403 - Bad Request" error.这在大多数网站上都可以正常工作,但有些网站会返回“403 - 错误请求”错误。 The same pages load perfectly fine in my browser.相同的页面在我的浏览器中加载得很好。 I have inspected the request header from my webbrowser and copied that exactly when trying to retrieve the same page in Perl and every single time I get a 403 error.我已经从我的网络浏览器检查了请求 header 并在尝试检索 Perl 中的同一页面时复制了该请求,并且每次我都收到 403 错误。 Here's a code snippet:这是一个代码片段:

use strict;

use LWP::UserAgent;
use HTTP::Cookies;

my $URL = "https://www.betsson.com/en/casino/jackpots";

my $browserObj = LWP::UserAgent->new(
  ssl_opts => { verify_hostname => 0 }
);

# $browserObj->cookie_jar( {} );
my $cookie_jar = HTTP::Cookies->new();
$browserObj->cookie_jar( $cookie_jar );

$browserObj->agent( "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0");
$browserObj->timeout(600);
push @{ $browserObj->requests_redirectable }, 'POST';

my @header = ( 'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
               'Accept-Encoding' => 'gzip, deflate, br',
               'Accept-Language' => 'en-US,en;q=0.5',
               'Connection' => 'keep-alive',
               'DNT' => '1',
               'Host' => 'www.bettson.com',
               'Upgrade-Insecure-Requests' => '1'
             );

my $response = $browserObj->get( $URL, @header );
if( $response->is_success ) {
  print "Success!\n";
} else {
  print "Unsuccessfull...\n";
}

How do these servers distinguish between a real browser and my script?这些服务器如何区分真正的浏览器和我的脚本? At first I thought they had some JavaScript trickery going on, but then I realized in order for that to work, the page has to be loaded by a browser first.起初我以为他们有一些 JavaScript 诡计,但后来我意识到为了让它起作用,页面必须首先由浏览器加载。 But I immediately get this 403 Error.但我立即得到这个 403 错误。

What can I do to debug this?我能做些什么来调试这个?

While 403 is a typical answer for bot detection, in this case the bot detection is not the cause of the problem.虽然 403 是机器人检测的典型答案,但在这种情况下,机器人检测并不是问题的原因。 Instead a typo in your code is:相反,您的代码中的错字是:

my $URL = "https://www.betsson.com/en/casino/jackpots";
...
               'Host' => 'www.bettson.com',

In the URL the domain name is www.betsson.com and this should be reflected in the Host header.在 URL 中,域名是www.betsson.com ,这应该反映在Host header 中。 But your Host header is slightly different: www.bettson.com .但是您的Host header 略有不同: www.bettson.com Since the Host header has the wrong name the request is rejected with 403 forbidden.由于主机 header 的名称错误,请求被拒绝,403 禁止。

And actually, it is not even needed to go through all this trouble since it looks like no bot detection is done at all.实际上,go 甚至不需要通过所有这些麻烦,因为看起来根本没有进行机器人检测。 Ie no need to set user-agent and fiddle with the headers but plain:即不需要设置用户代理和摆弄标题,但很简单:

my $browserObj = LWP::UserAgent->new();
my $response = $browserObj->get($URL);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM