简体   繁体   English

无法从我的学校站点获取我的日程表数据。 使用cURL登录将不起作用

[英]Unable to fetch my schedule data from my schools site. Login with cURL won't work

Edit: Why the minus one? 编辑:为什么减一?

What I am trying to do is the following: 我正在尝试做以下事情:

  • I am trying to login to my school site using cURL and grab the schedule to use it for my AI. 我正在尝试使用cURL登录到我的学校站点,并获取将其用于我的AI的时间表。

So I need to login using my pass and number, but the form on the school site also needs a hidden 'token'. 因此,我需要使用通行证和密码登录,但学校网站上的表格也需要一个隐藏的“令牌”。

<form action="index.php" method="post">
    <input type="hidden" name="token" value="becb14a25acf2a0e697b50eae3f0f205" />
    <input type="text" name="user" />
    <input type="password" name="password" />
    <input type="submit" value="submit">
</form>

I'm able to successfully retrieve the token. 我能够成功检索令牌。 Then I try to login, but it fails. 然后,我尝试登录,但是失败。

// Getting the whole website
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://www.school.com');
$data = curl_exec($ch);

// Retrieving the token and putting it in a POST
$regex = '/<regexThatWorks>/';
preg_match($regex,$data,$match);
$postfields = "user=<number>&password=<secret>&token=$match[1]";

// Should I use a fresh cURL here?

// Setting the POST options, etc.
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postfields);

// I won't use CURLOPT_RETURNTRANSFER yet, first I want to see results. 
$data = curl_exec($ch);

curl_close($ch); 

Well... It doesn't work... 好吧...那不行...

  • Is it possible the token changes every curl_exec? 令牌是否有可能在每个curl_exec中更改? Because the site doesn't recognize the script the second time... 因为该站点第二次无法识别脚本,所以...
  • Should I create a new cURL instance(?) for the second part? 我应该为第二部分创建一个新的cURL实例吗?
  • Is there another way to grab the token within 1 connection? 还有另一种方法可以在1个连接内获取令牌?
  • Cookies? 饼干?

What's the error message you get? 您得到的错误消息是什么? Independently of that; 独立于此; your school's website might check the referrer header and make sure that the request is coming from (an application pretending to be...) its login page. 您学校的网站可能会检查引荐来源标头,并确保该请求来自其登录页面(假装为...的应用程序)。

This is how I solved it. 这就是我解决的方法。 The problem was probably the 'not-using-cookies' part. 问题可能出在“不使用Cookie”部分。 Still this is probably 'ugly' code, so any improvements are welcome! 仍然这可能是“丑陋的”代码,因此欢迎进行任何改进!

// This part is for retrieving the token from the hidden field.
// To be honest, I have no idea what the cookie lines actually do, but it works.
$getToken= curl_init();
curl_setopt($getToken, CURLOPT_URL, '<schoolsite>');       // Set the link
curl_setopt($getToken, CURLOPT_COOKIEJAR, 'cookies.txt');  // Magic
curl_setopt($getToken, CURLOPT_COOKIEFILE, 'cookies.txt'); // Magic
curl_setopt($getToken, CURLOPT_RETURNTRANSFER, 1);         // Return only as a string
$data = curl_exec($token);                                 // Perform action

// Close the connection if there are no errors
if(curl_errno($token)){print curl_error($token);}
else{curl_close($token);} 

// Use a regular expression to fetch the token
$regex = '/name="token" value="(.*?)"/';
preg_match($regex,$data,$match);

// Put the login info and the token in a post header string
$postfield = "token=$match[1]&user=<number>&paswoord=<mine>";
echo($postfields);

// This part is for logging in and getting the data.
$site = curl_init();
curl_setopt($site, CURLOPT_URL, '<school site');
curl_setopt($site, CURLOPT_COOKIEJAR, 'cookies.txt');    // Magic
curl_setopt($site, CURLOPT_COOKIEFILE, 'cookies.txt');   // Magic
curl_setopt($site, CURLOPT_POST, 1);                     // Use POST (not GET)
curl_setopt($site, CURLOPT_POSTFIELDS, $postfield);      // Insert headers
$forevil_uuh_no_GOOD_purposes = curl_exec($site);        // Output the results

// Close connection if no errors           
if(curl_errno($site)){print curl_error($site);}
else{curl_close($site);} 

As you're building a scraper, you can create your own classes to work for what you need to do in your domain. 在构建抓取工具时,您可以创建自己的类来满足您在域中需要执行的操作。 You can start by creating your own set of request and response classes that deal with what you need to deal with. 您可以从创建自己的请求和响应类集开始,以处理需要处理的内容。

Creating your own request class will allow you to implement the curl request the way you need it. 创建自己的请求类将使您能够以所需的方式实现curl请求。 Creating your own response class can you help you access/parse the returned HTML. 创建自己的响应类可以帮助您访问/解析返回的HTML。

This is a simple usage example of some classes I've created for a demo: 这是我为演示创建的一些类的简单用法示例:

# simple get request
$request = new MyRequest('http://hakre.wordpress.com/');
$response = new MyResponse($request);
foreach($response->xpath('//div[@id="container"]//div[contains(normalize-space(@class), " post ")]') as $node)
{
    if (!$node->h2->a) continue;
    echo $node->h2->a, "\n<", $node->h2->a['href'] ,">\n\n"; 
}

It will return my blogs posts: 它将返回我的博客文章:

Will Automattic join Dec 29 move away from GoDaddy day?
<http://hakre.wordpress.com/2011/12/23/will-automattic-join-dec-29-move-away-from-godaddy-day/>

PHP UTF-8 string Length
<http://hakre.wordpress.com/2011/12/13/php-utf-8-string-length/>

Title belongs into Head
<http://hakre.wordpress.com/2011/11/02/title-belongs-into-head/>

...

Sending a get request then is easy as pie, the response can be easily accessed with an xpath expression (here SimpleXML ). 发送get请求就很容易了,通过xpath表达式(此处为SimpleXML )可以轻松访问响应。 XPath can be useful to select the token from the form field as it allows you to query data of the document more easily than with a regular expression. XPath可用于从表单字段中选择标记,因为与使用正则表达式相比,它使您更容易查询文档的数据。

Sending a post request was the next thing to build, I tried to write a login script for my blog and it turned out to work quite well. 发送帖子请求是接下来要构建的,我尝试为我的博客编写一个登录脚本,结果运行良好。 I needed to parse response headers as well, so I added some more routines to my request and response class. 我还需要解析响应头,因此我向请求和响应类添加了更多例程。

# simple post request
$request = new MyRequest('https://example.wordpress.com/wp-login.php');
$postFields = array(
    'log' => 'username', 
    'pwd' => 'password',
);
$request->setPostFields($postFields);
$response = new MyResponse($request->returnHeaders(1)->execute());
echo (string) $response; # output to view headers

Considering your scenario you might want to edit your own request class to better deal with what you need, mine already uses cookies as you're using them, too. 考虑到您的情况,您可能希望编辑自己的请求类以更好地处理您的需求,我的cookie在使用时也已经在使用。 So some code based on these classes for your scenario could look like: 因此,针对您的方案,基于这些类的一些代码可能类似于:

# input values
$url = '<schoolsite>';
$user  = '<number>';
$password = '<secret>';

# execute the first get request to obtain token
$response = new MyResonse(new MyRequest($url));
$token = (string) $response->xpath('//input[@name="token"]/@value');

# execute the second login post request
$request = new MyRequest($url);
$postFields = array(;
    'user' => $user, 
    'password' => $password,
    'token' => $token
);
$request->setPostFields($postFields)->execute();

Demo and code as gist . 演示编码要点

If you want to further improve this, the next step is that you create yourself a class for the "school service" that you make use of to fetch the schedule from: 如果您想进一步改善它,则下一步是为“学校服务”创建一个自己的班级,并利用该班级从以下位置获取时间表:

class MySchoolService
{
    private $url, $user, $pass;
    private $isLoggedIn;
    public function __construct($url, $user, $pass)
    {
        $this->url = $url;
        ...
    }
    public function getSchedule()
    {
        $this->ensureLogin();

        # your code to obtain the schedule, e.g. in form of an array.
        $schedule = ...

        return $schedule;
    }
    private function ensureLogin($reuse = TRUE)
    {
        if ($reuse && $this->isLoggedIn) return;

        # execute the first get request to obtain token
        $response = new MyResonse(new MyRequest($this->url));
        $token = (string) $response->xpath('//input[@name="token"]/@value');

        # execute the second login post request
        $request = new MyRequest($this->url);
        $postFields = array(;
            'user' => $this->user, 
            'password' => $this->password,
            'token' => $token
        );
        $request->setPostFields($postFields)->execute();

        $this->isLoggedIn = TRUE;
    }
}

After you've nicely wrapped the request/response logic into your MySchoolService class you only need to instantiate it with the proper configuration and you can easily use it inside your website: 在将请求/响应逻辑很好地包装到MySchoolService类中之后,您只需要使用适当的配置实例化它,就可以在您的网站内轻松使用它:

$school = new MySchoolService('<schoolsite>', '<number>', '<secret>');
$schedule = $school->getSchedule();

Your main script only uses the MySchoolService . 您的主脚本仅使用MySchoolService

The MySchoolService takes care of making use of MyRequest and MyResponse objects. MySchoolService需要利用的护理MyRequestMyResponse对象。

MyRequest takes care of doing HTTP requests (here with cUrl) with cookies and such. MyRequest负责使用cookie等执行HTTP请求(此处为cUrl)。

MyResponse helps a bit with parsing HTTP responses. MyResponse在解析HTTP响应方面MyResponse帮助。

Compare this with a standard internet browser: 将此与标准的Internet浏览器进行比较:

Browser: Handles cookies and sessions, does HTTP requests and parses responses.

MySchoolService: Handles cookies and sessions for your school, does HTTP requests and parses responses.

So you now have a school browser in your script that does what you want. 因此,您现在可以在脚本中使用学校的浏览器来完成所需的工作。 If you need more options, you can easily extend it. 如果需要更多选择,则可以轻松扩展它。

I hope this is helpful, the starting point was to prevent written the same lines of cUrl code over and over again and as well to give you a better interface to parse return values. 我希望这会有所帮助,起点是防止一遍又一遍地编写相同的cUrl代码行,并为您提供一个更好的接口来解析返回值。 The MySchoolService is some sugar on top that make things easy to deal with in your own website / application code. MySchoolService最重要,它可以使您轻松地在自己的网站/应用程序代码中进行处理。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM