简体   繁体   English


[英]Parse a website using NodeJs

I'm trying to parse the web site dl-protect and given a url of this type : http://www.dl-protect.com/F469D615 the output would be directly an uptobox link for example. 我正在尝试解析dl-protect网站,并提供以下类型的URL: http ://www.dl-protect.com/F469D615,例如,输出将直接是uptobox链接。

I tried to figure out how this service works using chrome dev console. 我试图弄清楚该服务如何使用chrome dev控制台工作。

First of all, there's 2 cases to considerate : 首先,有2种情况需要考虑:

  • You don't need to enter a captcha, you just need to click on the continue button. 您无需输入验证码,只需单击继续按钮即可。 Then the NodeJs program should return the URL (uptobox here) found on the second page 然后,NodeJs程序应返回第二页上找到的URL(此处为uptobox)。

  • You need to enter a captcha. 您需要输入验证码。 In this case the NodeJs program should return the URL of the captcha 在这种情况下,NodeJs程序应返回验证码的URL

So far, here's my code (written in ES6) : 到目前为止,这是我的代码(用ES6编写):

import request from 'request';
import cheerio from 'cheerio';

// try to respect the header has if it were coming from a browser
let options = {
  url: 'http://www.dl-protect.com/F469D615',
  headers: {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'fr,en-US;q=0.8,en;q=0.6,fr-FR;q=0.4',
    'Cache-Control': 'max-age=0', 
    'Connection': 'keep-alive', 
    'Content-Type': 'application/x-www-form-urlencoded', 
    'Host': 'www.dl-protect.com', 
    'Origin': 'http://www.dl-protect.com', 
    'Referer': 'http://www.dl-protect.com/F469D615', 
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/49.0.2623.108 Chrome/49.0.2623.108 Safari/537.36'

request.get(options, function (error, response, body) {
    if (!error && response.statusCode == 200) {
        // parse the body response with cheerio
        let $ = cheerio.load(body);

        // detect if a captcha is required
        let isCaptcha = !!$('#captcha').length;

        // url of the captcha if needed
        let captchaUrl = '';

        // display wether we need captcha or not
        switch (isCaptcha) {
            case true:
                captchaUrl = $('#captcha').attr('src');
                console.log(`Captcha required, URL : ${captchaUrl}`);
            case false:
                console.log('No captcha required');

        // get the key
        let formKey = $('form[name="ccerure"] input[name="key"]').attr('value');
        console.log(`key : ${formKey}`);

        // set the form as it's computed no need to get it
        // this param is just data about the browser so I ended up copying it once it was generated
        let formIn = [

        // if no captcha
        if (!isCaptcha) {
            // override the initial options by adding the necessary form data
            options = Object.assign({}, options, {form: {key: formKey, i: formIn, submitform: 'Continuer'}});

            // reach the same page with a post containing the following data : key, i and submitform
            request.post(options, function (error, response, body) {
                // console.log(response);
                // console.log(error);

When I look at the chrome dev panel (network tab + preserve log), as soon as I click on the continue button, it shows me this : 当我查看chrome开发人员面板(“网络”标签+保存日志)时,只要单击“继续”按钮,就会显示以下内容:


I really thought passing "key", "i" and "submitform" would be enough but it's not. 我真的以为通过“ key”,“ i”和“ submitform”就足够了,但事实并非如此。 It just get back to the first page instead of going to the second page with the URL. 它只是返回到第一页,而不是带有URL的第二页。

Any clue about how to get as output the uptobox link (in this case) would be really nice. 关于如何获取uptobox链接的任何线索(在这种情况下)都非常好。

Thanks ! 谢谢 !

Most website will try to protect themself against people scraping their site -- their reasons wary and the reasons will be their own -- however typically means to protect sites would be to use cookies and hidden fields etc, each of those being signed and timestamped and expired, and possibly even validated for single use in the backend. 大多数网站都会尝试保护自己免受他人抓取网站的麻烦-他们的理由保持警惕,而原因将是他们自己的理由-但是,通常,保护网站的方法是使用Cookie和隐藏字段等,每个都经过签名并加盖时间戳 ,已过期,甚至可能经过验证可在后端一次性使用

What this site does specifically is anyones guess, and a part of their internal security engineering. 任何人都可以猜到该站点的具体工作,这是其内部安全工程的一部分。

So you are probably out of luck for simple crawling like what you are trying to do, and you will need a full browser to do the work -- fortunately (for you) there are headless browsers such as PhantomJs which may be of help. 因此,您可能无法像您尝试的那样进行简单的爬网,并且您将需要一个完整的浏览器来完成工作-幸运的是(对您而言)有无头的浏览器,例如PhantomJs可能会有所帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM