简体   繁体   English

通过网页抓取访问网站

[英]Accessing a web site via web scrape

When attempting to web scrape Rubies , I am unable to get past the login. 尝试以网页方式抓取Rubies时 ,我无法通过登录。 I have absolutely no idea why I am not able to, but here are the cURL options that I am using. 我完全不知道为什么我不能这样做,但是这里是我正在使用的cURL选项。 If anyone sees a problem, I would greatly appreciate it! 如果有人发现问题,我将不胜感激!

curl_setopt_array($curl, array(
    CURLOPT_URL => "https://www.rubies.com/customer/account/loginPost/",
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_ENCODING => "",
    CURLOPT_MAXREDIRS => 10,
    CURLOPT_TIMEOUT => 30,
    CURLOPT_HEADER => true,
    CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
    CURLOPT_POST => 1,
    CURLOPT_POSTFIELDS => array('form_key' => "****", "login[username]" => "****", "login[password]" => "****", "persistent_remember_me" => 'on', "send" => ''),
    CURLOPT_FOLLOWLOCATION => 1,
    CURLOPT_COOKIEFILE => 'cookie.txt',
    CURLOPT_COOKIEJAR => 'cookie.txt',
    CURLOPT_HTTPHEADER => array(
        'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
        'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Host: www.rubies.com',
        'Content-Type: application/x-www-form-urlencoded',
        'Origin: https://www.rubies.com',
        'Referer: https://www.rubies.com/customer/account/',
        'Connection: keep-alive',
        'Cache-Control: no-cache',
        'Upgrade-Insecure-Requests: 1'
    ),
    CURLOPT_SSL_VERIFYPEER => false,
    CURLOPT_SSL_VERIFYHOST => false,
    CURLINFO_HEADER_OUT => true
));

I currently have the form key hard encoded, but I am not sure if I would have to change the form key depending on the login. 我目前已经对表单密钥进行了硬编码,但是我不确定是否需要根据登录名更改表单密钥。 The response from the post is empty, but I get redirected 2 times. 帖子的回复为空,但我被重定向了2次。 Once to the account page, then back to the login. 转到帐户页面,然后返回登录名。 If anyone can tell me what is going on, then I would appreciate it. 如果有人可以告诉我发生了什么,那么我将不胜感激。 I think they are using some kind of basic auth system. 我认为他们正在使用某种基本的身份验证系统。

Use fiddler2 or another packet sniffer to look at the cURL traffic both requests and responses. 使用fiddler2或其他数据包嗅探器查看请求和响应的cURL流量。 Compare that to the traffic using a browser. 使用浏览器将其与流量进行比较。

You probably either missed or mistyped a field, or missed follow-up steps like setting cookies and posting additional data. 您可能错过了或输入了错误的字段,或者错过了后续步骤(如设置Cookie和发布其他数据)。

Code for a login often requires fetching the login page, scraping a one-time token (changes with each page request), then posting as the first step. 登录代码通常需要获取登录页面,抓取一次性令牌(随每个页面请求更改),然后作为第一步发布。 This might trigger script code to set cookies and/or automatically submit other data. 这可能会触发脚本代码来设置cookie和/或自动提交其他数据。

you do several mistakes. 你犯了几个错误。

you say to the server that your POST body is application/x-www-form-urlencoded encoded, but you give CURLOPT_POSTFIELDS an array, so what you actually send to the server, is multipart/form-data encoded. 您对服务器说您的POST正文是application/x-www-form-urlencoded编码的,但是您给CURLOPT_POSTFIELDS一个数组,因此您实际发送到服务器的是multipart/form-data编码的。 to have curl send the post data as application/x-www-form-urlencoded , urlencode the data for CURLOPT_POSTFIELDS - with arrays specifically, http_build_query will do this for you. 要让curl以application/x-www-form-urlencoded发送帖子数据,请对CURLOPT_POSTFIELDS的数据进行urlencode-特别是使用数组,http_build_query将为您完成此操作。 furthermore, with POSTs when doing multipart/form-data or application/x-www-form-urlencoded , don't set the content-type header at all, curl will do it for you, automatically, depending on which encoding was used. 此外,在执行multipart/form-dataapplication/x-www-form-urlencoded时使用POST时,根本不要设置content-type标头,curl将自动为您完成,这取决于所使用的编码。 on that note, you shouldn't set the User-Agent header manually, either, but use CURLOPT_USERAGENT . 关于这一点,您也不应该手动设置User-Agent标头,而应使用CURLOPT_USERAGENT and you should not set the Host header either, curl generates that automatically, and you're more likely than curl to make a mistake. 而且您也不应该设置Host标头,curl会自动生成该标头,并且比curl更容易出错。 also, here you send a fake Referer header, some websites can detect when the referer is fake, it's safer just to set CURLOPT_AUTOREFERER , and make a real request, thus obtaining a real referer. 同样,在这里您发送了一个假的Referer标头,某些网站可以检测到该Referer是假的,只是设置CURLOPT_AUTOREFERER并发出真实请求,这样才更安全,从而获得一个真实的Referr。 also, to actually login to https://www.rubies.com/customer/account/loginPost/ , you need both a cookie session, and a form_key code, the form_key is probably tied to your cookie session, and probably a form of CSRF token, but you provide no code to acquire either. 同样,要实际登录到https://www.rubies.com/customer/account/loginPost/ ,您既需要Cookie会话,又需要一个form_key代码, form_key可能与您的cookie会话相关,并且可能是CSRF令牌,但您不提供任何代码来获取。 and on top of that, you may need a real referer . 最重要的是,您可能需要一个真正的referer

using hhb_curl from https://github.com/divinity76/hhb_.inc.php/blob/master/hhb_.inc.php , here's an example code i think would be able to log in, with a real username/password, doing none of the mistakes i listed above: 使用来自https://github.com/divinity76/hhb_.inc.php/blob/master/hhb_.inc.php的 hhb_curl,这是一个示例代码,我认为我可以使用真实的用户名/密码登录我上面没有列出任何错误:

<?php
declare(strict_types = 1);
require_once ('hhb_.inc.php');
$hc = new hhb_curl ();

$hc->_setComfortableOptions ();
$hc->exec ( 'https://www.rubies.com/customer/account/login/' ); // << getting a referer, form_key (csrf token?), and a session.
$domd = @DOMDocument::loadHTML ( $hc->getResponseBody () );
$csrf = NULL;

// extract the form_key
foreach ( $domd->getElementsByTagName ( "form" ) as $form ) {
    if ($form->getAttribute ( "class" ) !== 'form form-login') {
        continue;
    }
    foreach ( $form->getElementsByTagName ( "input" ) as $input ) {
        if ($input->getAttribute ( "name" ) !== 'form_key') {
            continue;
        }
        $csrf = $input->getAttribute ( "value" );
        break;
    }
    break;
}
if ($csrf === NULL) {
    throw new \RuntimeException ( 'failed to extract the form_key token!' );
}

$hc->setopt_array ( array (
        CURLOPT_POST => true,
        CURLOPT_POSTFIELDS => http_build_query ( array (
                'form_key' => $csrf,
                'login' => array (
                        'username' => '???',
                        'password' => '???' 
                ),
                'persistent_remember_me' => 'on',
                'send' => ''  // ??
        ) ) 
) );

$hc->exec ( 'https://www.rubies.com/customer/account/login/' );
hhb_var_dump ( $hc->getStdErr (), $hc->getResponseBody () );

EDIT: fixed an url, the original code definitely wouldn't work, but it should now. 编辑:修复了一个URL,原始代码肯定无法正常工作,但现在应该可以了。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM