简体   繁体   中英

how to access pages with querystring using request and node.js

I wrote code for a simple web scraper using Node.js and an online tutorial to gather info for BuzzFeed quizzes. It works fine for the main page ( https://www.buzzfeed.com/quizzes ) but when I try to use it on any of the other pages (ie https://www.buzzfeed.com/quizzes?page=4 ), I get no results. I'm not sure what's wrong? Here's my code:

var request = require('request');
var cheerio = require('cheerio');
var fs = require('fs');
var options = {
    method: 'GET',
    uri: 'https://www.buzzfeed.com/quizzes',
    qs: {
      page: 4
    }
}

request(options, function(error, response, html) {
    if(!error && response.statusCode == 200) {
      var $ = cheerio.load(html);

      $('div.card.js-feed-item').each(function( index ) { 
        var title = $(this).find('h2').text().trim();
        var link = $(this).find('a.link-gray').attr('href');
        var image = $(this).find('a.link-gray > div.js-progressive-image').attr('data-background-src');
        fs.appendFileSync('buzzfeed.txt', title + '\n' + link + '\n' + image + '\n\n');
      });
}});

Basically, if I comment out this:

qs: {
    page: 4
}

it works fine. Am I using qs wrong?

查看页面上完成的请求,实际上,您可以只删除以下URL:“ https://www.buzzfeed.com/quizzes?render_template=0 ”,它为您提供了一个带有2个字段的json:cards(信息数组)和nextPage(类似于/ quizzes?render_template = 0&page = 2),您可以使用我认为相同的数据。

Looks like the BuzzFeed server wants to send back a compressed response. If you look at the documentation for the request module you can find this option:

gzip - If true , add an Accept-Encoding header to request compressed content encodings from the server (if not already present) and decode supported content encodings in the response.

So in your case just just adding gzip: true to your options object should work. Be warned though, depending on how much the page relies on JS to show its content, the HTML might not be what you expect.


How did I work this out? Well basically if you examine the returned response object (outside the if statement) you can get some pretty useful information.

For instance we can check if the qs option is working by checking the request url using response.request.url (or response.request.href ) and seeing (via console.log or a debugger) that it correctly formed the query string ( ?page=4 ), so that's not the issue.

Digging further we can see that response.statusCode is 500 and response.body (or the html param) is {"message": "INTERNAL_ERROR"} . This seems to indicate a "server error", however we can visit the page just fine in our browser, so in reality it seems like the server just doesn't like how we formed our request for some reason.

At times like these it's worth checking out response.headers , where we can see eg that content-type is application/json (which is clearly not what you want). But more interestingly, there is a vary header where one of the values is Accept-Encoding - this is basically saying "if you make this request again with a different Accept-Encoding header, you will get a different response". Accept-Encoding is almost always used to specify types of compression you can deal with, of which gzip is the most commonly supported by servers, hence the gzip option provided by the Node request module. If you open up the network tab of your browser devtools and browse to the URL, you can see the same header is being set (in Chrome, filter requests by "Doc" to find it more easily).

Edit: your original code seems to be working for me now, so maybe it was a server issue after all.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM