简体   繁体   中英

Logging in using WWW::Mechanize

I'm looking at logging in to https://imputationserver.sph.umich.edu/index.html#!pages/login with the following:

#!/usr/bin/env perl

use strict;
use warnings FATAL => 'all';
use feature 'say';
use autodie ':all';
use WWW::Mechanize;
use DDP;

my $mech = WWW::Mechanize->new();
$mech->get( 'https://imputationserver.sph.umich.edu/index.html#!pages/login' );
my $username = '';
my $password = '';
#$mech->set_visible( $username, $password );
#$mech -> field('Username:', $username);
#$mech -> field('Password:', $password);

my %data;
@{ $data{links} } = $mech -> find_all_links();
@{ $data{inputs}    } = $mech -> find_all_inputs();
@{ $data{submits} } = $mech ->find_all_submits();
@{ $data{forms} } = $mech -> forms();
p %data;

#$mech->set_fields('Username' => $username, 'Password' => $password);

but there doesn't appear to be any useful information, which is shown by printing:

{
    forms     [],
    inputs    [],
    links     [
        [0] WWW::Mechanize::Link  {
            public methods (9) : attrs, base, name, new, tag, text, URI, url, url_abs
            private methods (0)
            internals: [
                [0] "favicon.ico",
                [1] undef,
                [2] undef,
                [3] "link",
                [4] URI::https,
                [5] {
                    href   "favicon.ico",
                    rel    "icon"
                }
            ]
        },
        [1] WWW::Mechanize::Link  {
            public methods (9) : attrs, base, name, new, tag, text, URI, url, url_abs
            private methods (0)
            internals: [
                [0] "assets/css/loader.css",
                [1] undef,
                [2] undef,
                [3] "link",
                [4] var{links}[0][4],
                [5] {
                    href   "assets/css/loader.css",
                    rel    "stylesheet"
                }
            ]
        }
    ],
    submits   []
}

I looked on Firefox's Tools -> page info, but got nothing valuable there, I don't see where the username and password are coming from on this page.

I've tried

$mech -> submit_form(
    form_number => 0,
    fields      => { username => $username, password => $password },
);

but then I get No form defined

In terms of links, inputs, fields, I don't see any, and I don't know how to move on.

I don't see anything on https://metacpan.org/pod/WWW::Mechanize::Examples that helps me out in this situation.

How can I log in to this page using Perl's WWW::Mechanize?

As Dave says, many modern websites are going to be handling login via a Javascript-driven (private) API. You'll need to open the Network tab in your browser, do the login manually as you normally would, and watch the sequence of GETs, PUTs, POSTs, etc. that happen to see what interaction is needed to complete a login, and then execute that sequence yourself with Mech or LWP .

It's possible that the Javascript on the page is going to create JSON or even JWTs to do the interactions; you'll have to duplicate that in your code for it to work.

In particular, check the headers for cookies, and authentication and CSRF tokens being set; you'll need to capture those and re-send them with requests (POST requests will need the CSRF tokens). This may entail doing more interactions with the site to capture the sequence of operations and duplicate them. HTTP::Cookies should handle the cookies for you automatically, but more sophisticated header usage will require you to use HTTP::Headers to extract the data and possibly resubmit it that way.

At heart, the processes are all pretty simple; it's just a matter of accurately replicating them so that you can automate them.

You should check as to whether the site already has a programmer's API, and use that if so; such an API will almost always provide you simpler, direct interfaces to site functions and easier-to-use returned data formats. If the site is highly dynamic, like a heavy React site, it's possible that other pages in the site are going to load a skeletal HTML page and then use Javascript to fill it out as well; as the page evolves, your code will have to as well. If you're using a defined programmer's API, you will probably be able to depend on the interactions and returned data remaining the same as long as the API version doesn't change.

A final note: you should verify that you're not violating your user agreement by using automation. Some sites explicitly bar using automated methods of logging in.

The interesting part of the source from that page is this:

<body class="bg-light">

  <div id="main">
    <div class="spinner">
        <div class="bounce1"></div>
      <div class="bounce2"></div>
      <div class="bounce3"></div>
    </div>
  </div>

  <script src="./dist/bundles/cloudgene/index.js"></script>


</body>

So, there's no login form in the HTML that makes up that page. Which explains why WWW::Mechanize can't see anything - there's nothing there to see.

It seems that that the page is all built by that Javascript file - index.js .

Now, you could spend hours reading that JS and working exactly how the page works. But that'll be hard work and there's an easier way.

No matter how the client (the browser or your code) works, the actual login must be handled by an HTTP request and response. The client sends a request, the server responds and the client acts on that response. You just need to work out what the request and response look like and then reproduce that in your code.

And you can examine the HTTP requests and response using tools that are almost certainly built into your browser (in Chrome, it's dot menu -> more tools -> developer tools). That will allow you to see exactly what the HTTP request looks like.

Having done that, you "just" need to craft a similar response using your Perl code. You'll probably find that's easier using LWP::UserAgent and its associated modules rather than WWW::Mechanize.

WWW::Mechanize is a web client with some HTML parsing capabilities. But as Dave Cross pointed out, the form you want is not in the HTML document you requested. It's generated by some JavaScript code. To do what the browser does would require a JavaScript engine, which WWW::Mechanize doesn't have.

The simplest way to achieve that is to remote-control a web browser (eg using Selenium::Chrome ).

The other is to manually craft the login request without getting and filling the form.

Looking at your code, I see the following URL:

https://imputationserver.sph.umich.edu/index.html#!pages/login

It is this part in particular that drew my attention: #!pages/login

This likely means that the login form is not present on the page when it is loaded, and is instead added to the page with JavaScript after page load.

Your script doesn't know this, however, and looks for the login form and its elements right away after page load.

The easiest way to solve this issue is to place a hard-coded timeout of, let's say, 5 seconds between page load and trying to log in.

The more "correct" way of handling this is to wait for the login form to appear by checking for its controls, and then proceed with the login process.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM