简体   繁体   English

如何让LWP :: UserAgent看起来像另一个浏览器?

[英]How can I make LWP::UserAgent look like another browser?

This is my first post on SO, so be gentle. 这是我关于SO的第一篇文章,所以要温柔。 I'm not even sure if this belongs here, but here goes. 我甚至不确定这是否属于这里,但是这里有。

I want to access some information on one of my personal accounts. 我想访问我的一个个人帐户的一些信息。 The website is poorly written and requires me to manually input the date I want the information for. 该网站写得不好,要求我手动输入我想要的信息日期。 It is truly a pain. 这确实很痛苦。 I have been looking for an excuse to learn more Perl so I thought this would be a great opportunity. 我一直在寻找借口来学习更多Perl所以我认为这将是一个很好的机会。 My plan was to write a Perl script that would login to my account and query the information for me. 我的计划是编写一个Perl脚本,该脚本将登录到我的帐户并为我查询信息。 However, I got stuck pretty quickly. 但是,我很快陷入困境。

my $ua = LWP::UserAgent->new;
my $url = url 'https://account.web.site';
my $res = $ua->request(GET $url);

The resulting web page basically says that my web browser is not supported. 生成的网页基本上表示我的网络浏览器不受支持。 I tried a number of different values for 我尝试了许多不同的值

$ua->agent("");

but nothing nothings seems to work. 但没有任何事情似乎有效。 Google-ing around suggests this method, but it also says that perl is used for malicious reasons on web sites. 谷歌周围建议使用这种方法,但它也表示perl在网站上被用于恶意原因。 Do web sites block this method? 网站阻止这种方法吗? Is what I am trying to do even possible? 我甚至想做什么? Is there a different language that would be more appropriate? 是否有更合适的不同语言? Is what I'm trying to do even legal or even a good idea? 我正在努力做甚至合法甚至是好主意? Maybe I should just abandon my efforts. 也许我应该放弃我的努力。

Note that to prevent giving away any private information, the code I wrote here is not the exact code I am using. 请注意,为了防止泄露任何私人信息,我在这里写的代码不是我正在使用的确切代码。 I hope that was pretty obvious, though. 但我希望这很明显。

EDIT: In FireFox, I disabled JavaScript and CSS. 编辑:在FireFox中,我禁用了JavaScript和CSS。 I logged in just fine without the "Incompatible browser" error. 我没有“不兼容的浏览器”错误就登录了。 It doesn't seem to be JavaScript issue. 它似乎不是JavaScript问题。

Getting a different webpage with scraping 通过抓取获取不同的网页

We have to make one assumption, the web-server will return the same output if given the same input. 我们必须做出一个假设,如果给出相同的输入,web服务器将返回相同的输出。 With this assumption we inescapably come to the conclusion we're not giving it the same input. 有了这个假设,我们不可避免地得出结论,我们没有给它相同的输入。 There are two browsers , or http clients in this scenario: the one that is giving you the result you want (ex., Firefox, IE, Chrome, or Safari), and the one that is not giving you the result you want (ex., LWP, wget, or cURL). 在这种情况下有两个浏览器或http客户端:一个为您提供所需结果的浏览器(例如,Firefox,IE,Chrome或Safari),以及提供您想要结果的浏览器(例如。,LWP,wget或cURL)。

Kill off the easy possibilities first 首先取消简单的可能性

Before, continuing firstly make sure the simple UserAgents are the same, you can do this by browsing to whatsmyuseragent.com and setting the UserAgent string in the header of the other browser to whatever that website returns. 之前,继续首先确保简单的UserAgent是相同的,您可以通过浏览到whatsmyuseragent.com并将其他浏览器的标题中的UserAgent字符串设置为该网站返回的任何内容来执行此操作。 You can also use Firefox's Web Developer's Toolbar to disable CSS, and JavaScript, Java, and meta-redirects: this will help you track down the problem by killing off the really simple stuff. 您还可以使用Firefox的Web开发人员工具栏来禁用CSS,JavaScript,Java和元重定向:这将帮助您通过消除非常简单的内容来追踪问题。

Now attempt to duplicate the working browser 现在尝试复制工作浏览器

Now with Firefox you can use FireBug to analyze the REQUEST that is sent. 现在使用Firefox,您可以使用FireBug来分析发送的REQUEST You can do this under the NET tab in FireBug, different browsers should have tools that can do what FireBug does with FireFox; 您可以在FireBug的NET选项卡下执行此操作,不同的浏览器应该具有可以执行FireBug对FireFox执行操作的工具; however, if you don't know the tool in question you can still use tshark or wireshark as described below. 但是,如果您不知道相关工具,您仍然可以使用tsharkwireshark ,如下所述。 It is important to note that tshark and wireshark will always be more accurate because they work at a lower level which at least in my experience leaves less room for error. 重要的是要注意tsharkwireshark总是更准确,因为它们工作在较低的水平,至少在我的经验中留下较少的错误空间。 For example, you'll see things like meta-redirects the browser is doing which sometimes FireBug can lose track of. 例如,您会看到浏览器正在进行的元重定向等操作,有时FireBug可能无法跟踪。

After you understand the first web-request that works, do your best to set the second web-request to that of the first. 了解了第一个有效的Web请求后,请尽力将第二个Web请求设置为第一个Web请求。 By this I mean setting the request-headers properly and other request elements. 我的意思是正确设置请求标头和其他请求元素。 If this still doesn't work you have to know what the second browser is doing to see what is wrong. 如果这仍然不起作用,你必须知道第二个浏览器正在做什么来查看错误。

Troubleshooting 故障排除

In order to troubleshoot this, we must have a total understanding of the requests from both browsers. 为了解决这个问题,我们必须完全理解来自两个浏览器的请求。 The second browser is usually tricker, these are often libraries and non-interactive command line browsers that lack the ability to check the request. 第二个浏览器通常是tricker,这些浏览器通常是库和非交互式命令行浏览器,它们无法检查请求。 If they have the ability to dump the request you might still opt to simply check them anyway. 如果他们有能力转储请求,您仍然可以选择仅检查它们。 To do this I suggest the wireshark and tshark suite . 要做到这一点,我建议wireshark和tshark套件 Immediately, you should be warned that because these operate below the browser. 请立即警告您,因为这些操作在浏览器下方运行。 By default, you'll see the actual network (IP) packets, and data-link frames. 默认情况下,您将看到实际的网络(IP)数据包和数据链路帧。 You can filter out what you need specifically with a command like this. 您可以使用这样的命令过滤掉您需要的内容。

sudo tshark -i <interface> -f tcp -R "http.request" -V |
perl -ne'print if /^Hypertext/../^Frame/'

This will capture all of the TCP packets, display-filter only the http.requests , then perl filter for only layer 4 HTTP stuff. 这将捕获所有TCP数据包,仅显示 - 过滤http.requests ,然后仅对第4层HTTP内容进行perl过滤。 You might want to add to the display filter to only grab a single web server too -R "http.request and http.host == ''" 您可能希望添加到显示过滤器以仅获取单个Web服务器-R "http.request and http.host == ''"

You're going to want to check everything to see if the two requests are in line, cookies, GET url, user-agent, etc. Make sure the site doesn't do something goofy. 您将要检查所有内容,以查看这两个请求是否在线,Cookie,GET URL,用户代理等。确保网站不会做一些愚蠢的事情。

Updated Jan 23 2010 : Based on the new information I would suggest setting Accept , and Accept-Language , Accept-Charset and Accept-Encoding . 2010年1月23日更新 :根据新信息,我建议设置AcceptAccept-LanguageAccept-CharsetAccept-Encoding You can do that with through $ua->default_headers() . 您可以通过$ua->default_headers() If what you demand is a lot more functionality out of your useragent, you can always subclass it. 如果您要求的是更多功能,那么您可以随时对其进行子类化。 I took this aproach for my GData API, you can find my example on of a UserAgent subclass on github . 我把这个方法用于我的GData API,你可以在github上的UserAgent子类上找到我的例子

You should probably look at WWW::Mechanize , which is a subclass of LWP::UserAgent that is oriented towards that sort of website automation. 你应该看看WWW :: Mechanize ,它是LWP :: UserAgent的一个子类,面向那种网站自动化。 In particular, see the agent_alias method . 特别是,请参阅agent_alias方法

Some websites do block connections based on the User-Agent, but you can set that to whatever you want using Perl. 有些网站会阻止基于User-Agent的连接,但您可以使用Perl将其设置为您想要的任何内容。 It's possible that a website might also look for other request headers normally generated by a particular browser (like the Accept header) and refuse connections that don't include them, but you can add those headers too, if you figure out what it's looking for. 网站可能还会查找通常由特定浏览器生成的其他请求标头(如Accept标头)并拒绝不包含它们的连接,但如果您弄清楚它们要查找的是什么,您也可以添加这些标头。

In general, it's impossible for a website to prevent a different client from impersonating a supported browser. 通常,网站不可能阻止不同的客户端冒充支持的浏览器。 No matter what it's looking for, you can eventually duplicate it. 无论它在寻找什么,你最终都可以复制它。

It's also possible that it's looking for JavaScript support. 它也可能正在寻找JavaScript支持。 In that case, you might look at WWW::Scripter , which is a subclass of WWW::Mechanize that adds JavaScript support. 在这种情况下,您可以查看WWW :: Scripter ,它是添加JavaScript支持的WWW :: Mechanize的子类。 It's fairly new and I haven't tried it yet. 这是相当新的,我还没有尝试过。

This thread is almost certainly not about merely changing User Agent. 这个线程几乎肯定不仅仅是改变用户代理。

I see two paths. 我看到两条路。 Either we can experiment with turning off javascript and css in browser, and learn more about getting into HTTP::Request and HTTP::Response objects while relying on LWP::UserAgent, or, go to WWW::Scripter and use javascript. 我们可以尝试在浏览器中关闭javascript和css,并了解有关在依赖LWP :: UserAgent时进入HTTP :: Request和HTTP :: Response对象的更多信息,或者转到WWW :: Scripter并使用javascript。

Just in crude Craigslist text ads, there are three pages of densely packed, almost space-free javascript and css, and then they load more and specialized code so that if I come in by comcast I then find special javascript, just targeting comcast users, has been loaded into the final page. 只是在原始的Craigslist文字广告中,有三页密集,几乎无空间的javascript和css,然后他们加载更多和专门的代码,以便如果我通过comcast进入我然后找到特殊的javascript,只针对comcast用户,已加载到最后一页。 The way they do that is in their attempt to break robots by putting code in the HEAD which lawyers the diff between HTML 1.0 and 1.1 to say, oh, there is something a little bit wrong, you need an http refresh, and then porking you with extra code to snoop out isp and who knows what, cookie info for sure(you can print out cookies at every turn when you learn how to slow LWP down and insert callback code to snoop like *shark but inside perl, also see how server keeps trying to change "your" headers and "your" request--re-negotiate "your" request--oh you don't want to buy a cheap car you want to buy a Maserati and mortgage your house to do it ie snoop your ISP and why not your Contacts and all your google history!!! Who knows?!). 他们这样做的方法是试图通过在HEAD中放置代码来破坏机器人,律师在HTML 1.0和1.1之间的差异说,哦,有一些有点错误,你需要一个http刷新,然后扼杀你有额外的代码来窥探isp和谁知道什么,cookie信息肯定(当你学会如何减慢LWP并插入回调代码以窥探像*鲨鱼但在perl内部时,你可以打印出cookie,也看看如何服务器一直试图改变“你的”标题和“你的”请求 - 重新协商“你的”请求 - 哦,你不想买一辆便宜的汽车,你想买一辆玛莎拉蒂并抵押你的房子去做,即窥探你的ISP,为什么不是你的联系人和你所有的谷歌历史!!!谁知道?!)。

CL puts a random ID number into Alice's HEAD, then whispers that you need an http request to swallow the red pill, stop hiding it under your tongue. CL将一个随机ID号码放入Alice的HEAD中,然后低声说你需要一个http请求来吞下红色药丸,然后将它隐藏在你的舌下。 That way most robots choke and accept a fake sanitized page ie truncated "home page". 这样大多数机器人都会扼杀并接受假的清理页面,即截断的“主页”。 Also, if I scrape url's from the page, I can't "click" on them using LWP because I never learned my ID, nor did I learn the javascript to parrot the ID back in javascript before a $ua->get( $url&ID=9dd887f8f89d9" ); or maybe the simple get would work with &ID. It's way more than User Agent but you can do it and you're getting all the help you need from 此外,如果我从网页上抓取网址,我不能使用LWP“点击”它们,因为我从未学过我的ID,也没有学习javascript以在$ ua> get($ j url&ID = 9dd887f8f89d9“);或者简单的get可以使用&ID。它比用户代理更多,但你可以做到这一点,你得到了所需的所有帮助

As you can see, the first path is to turn all that off and see if you can learn your re-negotiated request's URI, not original URL but URI. 正如您所看到的,第一条路径是关闭所有这些,看看您是否可以了解重新协商的请求的URI,而不是原始URL而是URI。 Then get it, no javascript, no WWW::Scripter. 然后得到它,没有javascript,没有WWW :: Scripter。 It sounds like LWP will work for you. 听起来LWP会对你有用。 I would like to hear more about changing ACCEPT's in default_header initially, and whether server says, oh, you mean ACCEPT this and this and this, swallow red pill in re-negotiate Request object. 我想听到更多关于最初在default_header中更改ACCEPT的信息,以及服务器是否说,哦,你的意思是接受这个以及这个和这个,在重新协商Request对象中吞下红色药丸。 You can snoop that by inserting callbacks in request and response conversation. 您可以通过在请求和响应对话中插入回调来窥探它。

Second path, WWW::Scripter, is only if we decided to swallow the Red Pill, and go down Alice's Rabbit Hole aka Matrix. 第二条路径,WWW :: Scripter,只有当我们决定吞下红色药丸,并沿着爱丽丝的兔子洞(又名Matrix)下去时。 perl philosophy dictates exhausting other possibilities before working harder. perl哲学决定在努力工作之前耗尽其他可能性。 Otherwise we wouldn't have learned our 101 http prereqs, so escalating to bigger hammer would be just that, or dropping acid for aspirin, or not? 否则我们就不会学习我们的101个http先决条件,所以升级到更大的锤子就是这样,或者是否为阿司匹林滴酸?

I just noticed something. 我刚才注意到了一些事 This line: 这一行:

my $res = $ua->request(GET $url);

It doesn't work on my machine at all. 它根本不适用于我的机器。 But I got it to work by changing it to: 但我通过将其更改为:

my $res = $ua->get($url);

I tried a number of different values for 我尝试了许多不同的值

$ua->agent(""); $ UA->剂( “”);

but nothing nothings seems to work. 但没有任何事情似乎有效。

Well, would you like to tell us what those things you tried were? 那么,你想告诉我们你尝试过的东西是什么吗?

What I normally do is type 我通常做的是打字

javascript:prompt('your agent string is',navigator.userAgent)

into my regular browser's URL bar, hit enter, and cut and paste what it tells me. 进入我常规浏览器的URL栏,按回车键,然后剪切并粘贴它告诉我的内容。 Surely using wireshark and monitoring actual packets is overkill? 当然使用wireshark并监控实际数据包是否过度杀伤? The website you're trying to get to has no way of knowing you're using Perl. 您尝试访问的网站无法知道您正在使用Perl。 Just tell it whatever it expects to hear. 只要告诉它预期听到的内容。

Tools: Firefox with TamperData and LiveHTTPHeaders, Devel::REPL, LWP. 工具:带有TamperData和LiveHTTPHeaders的Firefox,Devel :: REPL,LWP。

Analysis: In the browser, turn off Javascript and Java, delete any cookies from the target web site, start TamperData logging, log in to web site. 分析:在浏览器中,关闭Javascript和Java,从目标网站删除任何cookie,启动TamperData日志记录,登录网站。 Stop TamperData logging and look back through the many requests you likely placed during the login process. 停止TamperData日志记录并回顾您在登录过程中可能发出的许多请求。 Find the first request (the one you made on purpose) and look at its details. 找到第一个请求(您故意制作的请求)并查看其详细信息。

Experimentation: Start re.pl , and start recreating the browser's interaction. 实验:启动re.pl ,然后开始重新创建浏览器的交互。

use LWP::UserAgent;

my $ua = LWP::UserAgent->new(
  agent      => $the_UA_of_the_browser,
  cookie_jar => HTTP::Cookies->new(hide_cookie2 => 1),
);
$ua->default_headers(HTTP::Headers->new(
  %the_headers_sent_by_the_browser,
));

my $r = $ua->get($the_URL);
$r->content($r->decoded_content); print $r->as_string;

So that's step one. 这就是第一步。 If you get mismatched responses at any point, you did something wrong. 如果您在任何时候得到不匹配的答案,那么您做错了。 You can usually[1] find out what by looking at $r->request and comparing with the request Firefox sent. 您通常可以[1]通过查看$r->request并与Firefox发送的请求进行比较来找出$r->request The important thing is to remember that there is no magic and that you know everything the server knows. 重要的是要记住,没有魔法,你知道服务器知道的一切。 If you can't get the same response to what appears to be the same request, you missed something. 如果您无法对看似相同的请求得到相同的响应,那么您错过了一些内容。

Getting to the first page is usually not enough. 到第一页通常是不够的。 You'll likely need to parse forms (with HTML::Form ), follow redirects (as configured above, UA does that automatically, but sometimes it pays to turn that off and do it by hand), and try to reverse engineer a weirdly-hacked-together login sequence from the barest of hints. 您可能需要解析表单(使用HTML::Form ),遵循重定向(如上所述,UA会自动执行此操作,但有时需要将其关闭并手动执行),并尝试进行逆向工程。从最基本的提示中共同登录序列。 Good luck. 祝好运。

[1]: Except in the case of certain bugs in LWP's cookies implementation that I won't detail here. [1]:除了LWP的cookie实现中的某些错误之外,我在此不再详述。 And even then you can spot it if you know what you're looking for. 即便如此,如果您知道自己在寻找什么,也可以发现它。

adding the referrer portion made it work for me: 添加引用者部分使它适合我:

$req = HTTP::Request->new(GET => $url);
$req->header(Accept => "text/html, */*;q=0.1", referer => 'http://google.com');
$res = $ua->request($req);
print $res->status_line;

Is your perl script running on the same machine as the firefox browser you reference? 您的perl脚本是否与您引用的firefox浏览器在同一台机器上运行? It could be filtering based on subnet or incoming IP address. 它可以基于子网或传入IP地址进行过滤。 Your url is https, so there could be also be some PSK (pre shared key) or certificate loaded on you browser taht the server is expecting. 您的网址是https,因此服务器可能会在您的浏览器上加载一些PSK(预共享密钥)或证书。 Extremely unlikely outside of an internal companies intranet site. 在内部公司内部网站之外极不可能。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM