简体   繁体   中英

Accessing a web site via web scrape

When attempting to web scrape Rubies , I am unable to get past the login. I have absolutely no idea why I am not able to, but here are the cURL options that I am using. If anyone sees a problem, I would greatly appreciate it!

curl_setopt_array($curl, array(
    CURLOPT_URL => "https://www.rubies.com/customer/account/loginPost/",
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_ENCODING => "",
    CURLOPT_MAXREDIRS => 10,
    CURLOPT_TIMEOUT => 30,
    CURLOPT_HEADER => true,
    CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
    CURLOPT_POST => 1,
    CURLOPT_POSTFIELDS => array('form_key' => "****", "login[username]" => "****", "login[password]" => "****", "persistent_remember_me" => 'on', "send" => ''),
    CURLOPT_FOLLOWLOCATION => 1,
    CURLOPT_COOKIEFILE => 'cookie.txt',
    CURLOPT_COOKIEJAR => 'cookie.txt',
    CURLOPT_HTTPHEADER => array(
        'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
        'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Host: www.rubies.com',
        'Content-Type: application/x-www-form-urlencoded',
        'Origin: https://www.rubies.com',
        'Referer: https://www.rubies.com/customer/account/',
        'Connection: keep-alive',
        'Cache-Control: no-cache',
        'Upgrade-Insecure-Requests: 1'
    ),
    CURLOPT_SSL_VERIFYPEER => false,
    CURLOPT_SSL_VERIFYHOST => false,
    CURLINFO_HEADER_OUT => true
));

I currently have the form key hard encoded, but I am not sure if I would have to change the form key depending on the login. The response from the post is empty, but I get redirected 2 times. Once to the account page, then back to the login. If anyone can tell me what is going on, then I would appreciate it. I think they are using some kind of basic auth system.

Use fiddler2 or another packet sniffer to look at the cURL traffic both requests and responses. Compare that to the traffic using a browser.

You probably either missed or mistyped a field, or missed follow-up steps like setting cookies and posting additional data.

Code for a login often requires fetching the login page, scraping a one-time token (changes with each page request), then posting as the first step. This might trigger script code to set cookies and/or automatically submit other data.

you do several mistakes.

you say to the server that your POST body is application/x-www-form-urlencoded encoded, but you give CURLOPT_POSTFIELDS an array, so what you actually send to the server, is multipart/form-data encoded. to have curl send the post data as application/x-www-form-urlencoded , urlencode the data for CURLOPT_POSTFIELDS - with arrays specifically, http_build_query will do this for you. furthermore, with POSTs when doing multipart/form-data or application/x-www-form-urlencoded , don't set the content-type header at all, curl will do it for you, automatically, depending on which encoding was used. on that note, you shouldn't set the User-Agent header manually, either, but use CURLOPT_USERAGENT . and you should not set the Host header either, curl generates that automatically, and you're more likely than curl to make a mistake. also, here you send a fake Referer header, some websites can detect when the referer is fake, it's safer just to set CURLOPT_AUTOREFERER , and make a real request, thus obtaining a real referer. also, to actually login to https://www.rubies.com/customer/account/loginPost/ , you need both a cookie session, and a form_key code, the form_key is probably tied to your cookie session, and probably a form of CSRF token, but you provide no code to acquire either. and on top of that, you may need a real referer .

using hhb_curl from https://github.com/divinity76/hhb_.inc.php/blob/master/hhb_.inc.php , here's an example code i think would be able to log in, with a real username/password, doing none of the mistakes i listed above:

<?php
declare(strict_types = 1);
require_once ('hhb_.inc.php');
$hc = new hhb_curl ();

$hc->_setComfortableOptions ();
$hc->exec ( 'https://www.rubies.com/customer/account/login/' ); // << getting a referer, form_key (csrf token?), and a session.
$domd = @DOMDocument::loadHTML ( $hc->getResponseBody () );
$csrf = NULL;

// extract the form_key
foreach ( $domd->getElementsByTagName ( "form" ) as $form ) {
    if ($form->getAttribute ( "class" ) !== 'form form-login') {
        continue;
    }
    foreach ( $form->getElementsByTagName ( "input" ) as $input ) {
        if ($input->getAttribute ( "name" ) !== 'form_key') {
            continue;
        }
        $csrf = $input->getAttribute ( "value" );
        break;
    }
    break;
}
if ($csrf === NULL) {
    throw new \RuntimeException ( 'failed to extract the form_key token!' );
}

$hc->setopt_array ( array (
        CURLOPT_POST => true,
        CURLOPT_POSTFIELDS => http_build_query ( array (
                'form_key' => $csrf,
                'login' => array (
                        'username' => '???',
                        'password' => '???' 
                ),
                'persistent_remember_me' => 'on',
                'send' => ''  // ??
        ) ) 
) );

$hc->exec ( 'https://www.rubies.com/customer/account/login/' );
hhb_var_dump ( $hc->getStdErr (), $hc->getResponseBody () );

EDIT: fixed an url, the original code definitely wouldn't work, but it should now.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM