简体   繁体   English

如何使用request和node.js访问带有querystring的页面

[英]how to access pages with querystring using request and node.js

I wrote code for a simple web scraper using Node.js and an online tutorial to gather info for BuzzFeed quizzes. 我使用Node.js为一个简单的Web爬虫编写了代码,并为在线教程收集了BuzzFeed测验的信息。 It works fine for the main page ( https://www.buzzfeed.com/quizzes ) but when I try to use it on any of the other pages (ie https://www.buzzfeed.com/quizzes?page=4 ), I get no results. 它对于主页( https://www.buzzfeed.com/quizzes )正常工作,但是当我尝试在其他任何页面(即https://www.buzzfeed.com/quizzes?page=4)上使用时 ),但没有结果。 I'm not sure what's wrong? 我不知道怎么了? Here's my code: 这是我的代码:

var request = require('request');
var cheerio = require('cheerio');
var fs = require('fs');
var options = {
    method: 'GET',
    uri: 'https://www.buzzfeed.com/quizzes',
    qs: {
      page: 4
    }
}

request(options, function(error, response, html) {
    if(!error && response.statusCode == 200) {
      var $ = cheerio.load(html);

      $('div.card.js-feed-item').each(function( index ) { 
        var title = $(this).find('h2').text().trim();
        var link = $(this).find('a.link-gray').attr('href');
        var image = $(this).find('a.link-gray > div.js-progressive-image').attr('data-background-src');
        fs.appendFileSync('buzzfeed.txt', title + '\n' + link + '\n' + image + '\n\n');
      });
}});

Basically, if I comment out this: 基本上,如果我将其注释掉:

qs: {
    page: 4
}

it works fine. 它工作正常。 Am I using qs wrong? 我使用qs错误吗?

查看页面上完成的请求,实际上,您可以只删除以下URL:“ https://www.buzzfeed.com/quizzes?render_template=0 ”,它为您提供了一个带有2个字段的json:cards(信息数组)和nextPage(类似于/ quizzes?render_template = 0&page = 2),您可以使用我认为相同的数据。

Looks like the BuzzFeed server wants to send back a compressed response. 看起来BuzzFeed服务器想要发送回压缩的响应。 If you look at the documentation for the request module you can find this option: 如果查看request模块文档,则可以找到以下选项:

gzip - If true , add an Accept-Encoding header to request compressed content encodings from the server (if not already present) and decode supported content encodings in the response. gzip如果为true ,则添加一个Accept-Encoding标头以从服务器请求压缩的内容编码(如果尚不存在),并在响应中解码受支持的内容编码。

So in your case just just adding gzip: true to your options object should work. 因此,在您的情况下,只需将gzip: true添加到options对象即可。 Be warned though, depending on how much the page relies on JS to show its content, the HTML might not be what you expect. 但是要注意,取决于页面显示JS依赖页面的程度,HTML可能不是您所期望的。


How did I work this out? 我是如何解决的? Well basically if you examine the returned response object (outside the if statement) you can get some pretty useful information. 好吧,基本上,如果您检查返回的response对象(在if语句之外),则可以获得一些非常有用的信息。

For instance we can check if the qs option is working by checking the request url using response.request.url (or response.request.href ) and seeing (via console.log or a debugger) that it correctly formed the query string ( ?page=4 ), so that's not the issue. 例如,我们可以通过使用response.request.url (或response.request.href )检查请求url并(通过console.log或调试器)查看它是否正确构成了查询字符串( ?page=4 )来检查qs选项是否有效?page=4 ),所以这不是问题。

Digging further we can see that response.statusCode is 500 and response.body (or the html param) is {"message": "INTERNAL_ERROR"} . 进一步挖掘,我们可以看到response.statusCode500response.body (或html参数)为{"message": "INTERNAL_ERROR"} This seems to indicate a "server error", however we can visit the page just fine in our browser, so in reality it seems like the server just doesn't like how we formed our request for some reason. 这似乎表明“服务器错误”,但是我们可以在浏览器中正常访问页面,因此实际上,由于某种原因,服务器似乎不喜欢我们如何形成请求。

At times like these it's worth checking out response.headers , where we can see eg that content-type is application/json (which is clearly not what you want). 在这样的时候,值得检查出response.headers ,在这里我们可以看到例如content-typeapplication/json (显然不是您想要的)。 But more interestingly, there is a vary header where one of the values is Accept-Encoding - this is basically saying "if you make this request again with a different Accept-Encoding header, you will get a different response". 但是更有趣的是,有一个头值vary报头,其中一个值是Accept-Encoding这基本上是说“如果使用不同的Accept-Encoding头再次发出此请求,您将获得不同的响应”。 Accept-Encoding is almost always used to specify types of compression you can deal with, of which gzip is the most commonly supported by servers, hence the gzip option provided by the Node request module. Accept-Encoding几乎总是用于指定您可以处理的压缩类型,其中gzip是服务器最常支持的压缩类型,因此Node请求模块提供了gzip选项。 If you open up the network tab of your browser devtools and browse to the URL, you can see the same header is being set (in Chrome, filter requests by "Doc" to find it more easily). 如果打开浏览器devtools的网络标签并浏览到URL,则可以看到设置了相同的标头(在Chrome浏览器中,通过“文档”过滤请求以更轻松地找到它)。

Edit: your original code seems to be working for me now, so maybe it was a server issue after all. 编辑:您的原始代码现在似乎对我有用,因此也许毕竟是服务器问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM