简体   繁体   中英

Parse a website using NodeJs

I'm trying to parse the web site dl-protect and given a url of this type : http://www.dl-protect.com/F469D615 the output would be directly an uptobox link for example.

I tried to figure out how this service works using chrome dev console.

First of all, there's 2 cases to considerate :

  • You don't need to enter a captcha, you just need to click on the continue button. Then the NodeJs program should return the URL (uptobox here) found on the second page

  • You need to enter a captcha. In this case the NodeJs program should return the URL of the captcha

So far, here's my code (written in ES6) :

import request from 'request';
import cheerio from 'cheerio';

// try to respect the header has if it were coming from a browser
let options = {
  url: 'http://www.dl-protect.com/F469D615',
  headers: {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'fr,en-US;q=0.8,en;q=0.6,fr-FR;q=0.4',
    'Cache-Control': 'max-age=0', 
    'Connection': 'keep-alive', 
    'Content-Type': 'application/x-www-form-urlencoded', 
    'Host': 'www.dl-protect.com', 
    'Origin': 'http://www.dl-protect.com', 
    'Referer': 'http://www.dl-protect.com/F469D615', 
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/49.0.2623.108 Chrome/49.0.2623.108 Safari/537.36'
  }
};

request.get(options, function (error, response, body) {
    if (!error && response.statusCode == 200) {
        // parse the body response with cheerio
        let $ = cheerio.load(body);

        // detect if a captcha is required
        let isCaptcha = !!$('#captcha').length;

        // url of the captcha if needed
        let captchaUrl = '';

        // display wether we need captcha or not
        switch (isCaptcha) {
            case true:
                captchaUrl = $('#captcha').attr('src');
                console.log(`Captcha required, URL : ${captchaUrl}`);
                break;
            case false:
                console.log('No captcha required');
                break;
        }

        // get the key
        let formKey = $('form[name="ccerure"] input[name="key"]').attr('value');
        console.log(`key : ${formKey}`);

        // set the form as it's computed no need to get it
        // this param is just data about the browser so I ended up copying it once it was generated
        let formIn = [
            '_UETCF0UJREfkVmbpZWZk5Wd7QXYtJ3bGBCduVWb1N2bEBSZsJWY0J3bQtj',
            'cldXZpZXLmRGctwWYuJXZ05Wa7IXZ3VWaWBiREBFItVXat9mcoNkJkVmbpZ',
            'WZk5Wd74CduVGdu92Yg8WZklmdv8WakVXYgwUTUhEIm9GIrNWYilXYsBHIy',
            '9mZgMXZz5WZjlGbgUmbpZXZkl2VgMXZsJWYuV0OvNnLyVGdwFGZh1GZjVmb',
            'pZXZkl2dilGb7UGb1R2bNBibvlGdwlncjVGRgQnblRnbvNEIl5Wa2VGZpdl',
            'JkVmbpZWZk5Wd7sTahpGall2ZmV2bo9mZvp2blFGciJmamN2Zk1mYmpGatt',
            'jcldXZpZFIGREUg0Wdp12byh2Q8ZzMuczM18SayFmZhNFI4ATMuMjM2IjLw',
            '4SO08SZt9mcoNEI4ATMuMjM2IjLw4SO08Sb1lWbvJHaDBSd05WdiVFIp82a',
            'jV2RgU2apxGIswUTUh0SoAiNz4yNzUzL0l2SiV2VlxGcwFEIpQjNfZDO4BC',
            'e15WaMByOxEDWoACMuUzLhxGbpp3bNxHNygHN0YDewMTN=='
        ].join('');

        // if no captcha
        if (!isCaptcha) {
            // override the initial options by adding the necessary form data
            options = Object.assign({}, options, {form: {key: formKey, i: formIn, submitform: 'Continuer'}});

            // reach the same page with a post containing the following data : key, i and submitform
            request.post(options, function (error, response, body) {
                console.log(body);
                // console.log(response);
                // console.log(error);
            });
        }
    }
});

When I look at the chrome dev panel (network tab + preserve log), as soon as I click on the continue button, it shows me this :

chrome开发人员面板

I really thought passing "key", "i" and "submitform" would be enough but it's not. It just get back to the first page instead of going to the second page with the URL.

Any clue about how to get as output the uptobox link (in this case) would be really nice.

Thanks !

Most website will try to protect themself against people scraping their site -- their reasons wary and the reasons will be their own -- however typically means to protect sites would be to use cookies and hidden fields etc, each of those being signed and timestamped and expired, and possibly even validated for single use in the backend.

What this site does specifically is anyones guess, and a part of their internal security engineering.

So you are probably out of luck for simple crawling like what you are trying to do, and you will need a full browser to do the work -- fortunately (for you) there are headless browsers such as PhantomJs which may be of help.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM