简体   繁体   中英

How do I extract javascript from within html

I am creating a web scraping programme written in javascript, using request and cheerio. The webpage I'm trying to extract contains javascript within the html. It is the javascript that I'm interested in, however can't find a way to access it. Is there a way to extract the javascript, using cheerio?

Many thanks for any suggestions, I've just started with web scraping.

My code is:

var request = require('request');
var cheerio = require('cheerio');

var credentials = {
    username: 'username',
    password: 'password'
};

request.post({
    uri: 'http://webpage',
    headers: { 'content-type': 'application/x-www-form-urlencoded' },
    body: require('querystring').stringify(credentials)
}, function(err, res, body){
if(err) {
    callback.call(null, new Error('Login failed'));
    return;
}

request('http://webpage', function(err, res, body)
{
    if(err) {
        callback.call(null, new
            Error('Request failed'));
        return;
    }

    var $ = cheerio.load(body);
    var text = $('#element').text();
    console.log($.html());

}); 

});

If you're looking for the javascript inside the webpage, you can use cheerio to collect all <script> tags from the html and then get the content from them.

var scripts = [];

request('http://webpage', function(err, res, body)
{
  if(err) {
    callback.call(null, new Error('Request failed'));
    return;
  }

  var $ = cheerio.load(body);
  $('script').each(function(i, element) {
    scripts[i] = $(element).text();
  }   
});

You'll now have an array with all available javascript in the HTML. Now if they are imported javascript, then you won't get any content. You can search if the element has a src url.

...

$('script').each(function(i, element) {
  if ($(element).attr('src') === undefined) {
    scripts[i] = $(element).text();
  }
  else {
    // Collect or ignore this.
  }
}

...

I haven't tested this, but it should work based on cheerio's documentation.

https://github.com/cheeriojs/cheerio

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM