通过网页抓取访问网站

Question

When attempting to web scrape Rubies , I am unable to get past the login. 尝试以网页方式抓取Rubies时，我无法通过登录。 I have absolutely no idea why I am not able to, but here are the cURL options that I am using. 我完全不知道为什么我不能这样做，但是这里是我正在使用的cURL选项。 If anyone sees a problem, I would greatly appreciate it! 如果有人发现问题，我将不胜感激！

curl_setopt_array($curl, array(
    CURLOPT_URL => "https://www.rubies.com/customer/account/loginPost/",
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_ENCODING => "",
    CURLOPT_MAXREDIRS => 10,
    CURLOPT_TIMEOUT => 30,
    CURLOPT_HEADER => true,
    CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
    CURLOPT_POST => 1,
    CURLOPT_POSTFIELDS => array('form_key' => "****", "login[username]" => "****", "login[password]" => "****", "persistent_remember_me" => 'on', "send" => ''),
    CURLOPT_FOLLOWLOCATION => 1,
    CURLOPT_COOKIEFILE => 'cookie.txt',
    CURLOPT_COOKIEJAR => 'cookie.txt',
    CURLOPT_HTTPHEADER => array(
        'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
        'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Host: www.rubies.com',
        'Content-Type: application/x-www-form-urlencoded',
        'Origin: https://www.rubies.com',
        'Referer: https://www.rubies.com/customer/account/',
        'Connection: keep-alive',
        'Cache-Control: no-cache',
        'Upgrade-Insecure-Requests: 1'
    ),
    CURLOPT_SSL_VERIFYPEER => false,
    CURLOPT_SSL_VERIFYHOST => false,
    CURLINFO_HEADER_OUT => true
));

I currently have the form key hard encoded, but I am not sure if I would have to change the form key depending on the login. 我目前已经对表单密钥进行了硬编码，但是我不确定是否需要根据登录名更改表单密钥。 The response from the post is empty, but I get redirected 2 times. 帖子的回复为空，但我被重定向了2次。 Once to the account page, then back to the login. 转到帐户页面，然后返回登录名。 If anyone can tell me what is going on, then I would appreciate it. 如果有人可以告诉我发生了什么，那么我将不胜感激。 I think they are using some kind of basic auth system. 我认为他们正在使用某种基本的身份验证系统。

Answer 1

Use fiddler2 or another packet sniffer to look at the cURL traffic both requests and responses. 使用fiddler2或其他数据包嗅探器查看请求和响应的cURL流量。 Compare that to the traffic using a browser. 使用浏览器将其与流量进行比较。

You probably either missed or mistyped a field, or missed follow-up steps like setting cookies and posting additional data. 您可能错过了或输入了错误的字段，或者错过了后续步骤（如设置Cookie和发布其他数据）。

Code for a login often requires fetching the login page, scraping a one-time token (changes with each page request), then posting as the first step. 登录代码通常需要获取登录页面，抓取一次性令牌（随每个页面请求更改），然后作为第一步发布。 This might trigger script code to set cookies and/or automatically submit other data. 这可能会触发脚本代码来设置cookie和/或自动提交其他数据。

Answer 2

you do several mistakes. 你犯了几个错误。

you say to the server that your POST body is application/x-www-form-urlencoded encoded, but you give CURLOPT_POSTFIELDS an array, so what you actually send to the server, is multipart/form-data encoded. 您对服务器说您的POST正文是application/x-www-form-urlencoded编码的，但是您给CURLOPT_POSTFIELDS一个数组，因此您实际发送到服务器的是multipart/form-data编码的。 to have curl send the post data as application/x-www-form-urlencoded , urlencode the data for CURLOPT_POSTFIELDS - with arrays specifically, http_build_query will do this for you. 要让curl以application/x-www-form-urlencoded发送帖子数据，请对CURLOPT_POSTFIELDS的数据进行urlencode-特别是使用数组，http_build_query将为您完成此操作。 furthermore, with POSTs when doing multipart/form-data or application/x-www-form-urlencoded , don't set the content-type header at all, curl will do it for you, automatically, depending on which encoding was used. 此外，在执行multipart/form-data或application/x-www-form-urlencoded时使用POST时，根本不要设置content-type标头，curl将自动为您完成，这取决于所使用的编码。 on that note, you shouldn't set the User-Agent header manually, either, but use CURLOPT_USERAGENT . 关于这一点，您也不应该手动设置User-Agent标头，而应使用CURLOPT_USERAGENT 。 and you should not set the Host header either, curl generates that automatically, and you're more likely than curl to make a mistake. 而且您也不应该设置Host标头，curl会自动生成该标头，并且比curl更容易出错。 also, here you send a fake Referer header, some websites can detect when the referer is fake, it's safer just to set CURLOPT_AUTOREFERER , and make a real request, thus obtaining a real referer. 同样，在这里您发送了一个假的Referer标头，某些网站可以检测到该Referer是假的，只是设置CURLOPT_AUTOREFERER并发出真实请求，这样才更安全，从而获得一个真实的Referr。 also, to actually login to https://www.rubies.com/customer/account/loginPost/ , you need both a cookie session, and a form_key code, the form_key is probably tied to your cookie session, and probably a form of CSRF token, but you provide no code to acquire either. 同样，要实际登录到https://www.rubies.com/customer/account/loginPost/ ，您既需要Cookie会话，又需要一个form_key代码， form_key可能与您的cookie会话相关，并且可能是CSRF令牌，但您不提供任何代码来获取。 and on top of that, you may need a real referer . 最重要的是，您可能需要一个真正的referer 。

using hhb_curl from https://github.com/divinity76/hhb_.inc.php/blob/master/hhb_.inc.php , here's an example code i think would be able to log in, with a real username/password, doing none of the mistakes i listed above: 使用来自https://github.com/divinity76/hhb_.inc.php/blob/master/hhb_.inc.php的 hhb_curl，这是一个示例代码，我认为我可以使用真实的用户名/密码登录我上面没有列出任何错误：

<?php
declare(strict_types = 1);
require_once ('hhb_.inc.php');
$hc = new hhb_curl ();

$hc->_setComfortableOptions ();
$hc->exec ( 'https://www.rubies.com/customer/account/login/' ); // << getting a referer, form_key (csrf token?), and a session.
$domd = @DOMDocument::loadHTML ( $hc->getResponseBody () );
$csrf = NULL;

// extract the form_key
foreach ( $domd->getElementsByTagName ( "form" ) as $form ) {
    if ($form->getAttribute ( "class" ) !== 'form form-login') {
        continue;
    }
    foreach ( $form->getElementsByTagName ( "input" ) as $input ) {
        if ($input->getAttribute ( "name" ) !== 'form_key') {
            continue;
        }
        $csrf = $input->getAttribute ( "value" );
        break;
    }
    break;
}
if ($csrf === NULL) {
    throw new \RuntimeException ( 'failed to extract the form_key token!' );
}

$hc->setopt_array ( array (
        CURLOPT_POST => true,
        CURLOPT_POSTFIELDS => http_build_query ( array (
                'form_key' => $csrf,
                'login' => array (
                        'username' => '???',
                        'password' => '???' 
                ),
                'persistent_remember_me' => 'on',
                'send' => ''  // ??
        ) ) 
) );

$hc->exec ( 'https://www.rubies.com/customer/account/login/' );
hhb_var_dump ( $hc->getStdErr (), $hc->getResponseBody () );

EDIT: fixed an url, the original code definitely wouldn't work, but it should now. 编辑：修复了一个URL，原始代码肯定无法正常工作，但现在应该可以了。

通过网页抓取访问网站

问题描述

2 个解决方案

解决方案1
1 2017-05-26 18:10:17

解决方案2
1 2017-05-26 18:54:51

通过网页抓取访问网站

问题描述

2 个解决方案

解决方案1 1 2017-05-26 18:10:17

解决方案2 1 2017-05-26 18:54:51

解决方案1
1 2017-05-26 18:10:17

解决方案2
1 2017-05-26 18:54:51