简体   繁体   English

Node.js网页抓取

[英]Node.js web-scraping

I'm trying to scrape some code, to get a link, and some text from a paragraph. 我正在尝试抓取一些代码,获取链接以及段落中的一些文本。 But for some reason my code dosen't work, i have tried alot, and every time, it just gives me undifined. 但是由于某种原因,我的代码无法正常工作,我已经尝试了很多次,每次,它都使我感到困惑。

var request = require('request');
var cheerio = require('cheerio');

request('https://bitskins.com', function (error, response, html) {
  if (!error && response.statusCode == 200) {
    var $ = cheerio.load(html);
    $('p', '.chat-box-content').each(function(i, element){
        if($(this).attr('style') == 'height: 15px;'){
            console.log($(this));
        }
    });
  }
});

https://gyazo.com/b80465474a389657c44aeeb64888a006 https://gyazo.com/b80465474a389657c44aeeb64888a006

I only wan it to return the second and the third line, so the link and the price, but do i have to do? 我只想返回第二和第三行,以便链接和价格,但是我必须这样做吗? I'm new and i lost. 我是新来的,我迷路了。

The problem is that when you request the page, the chat box is a collapsed/hidden state, and all the <p> links (which are apparently placeholders) are empty. 问题在于,当您请求页面时,聊天框处于折叠/隐藏状态,并且所有<p>链接(显然是占位符)都是空的。 If open the chat box, some JavaScript on the page runs and populates the list. 如果打开聊天框,则会运行页面上的一些JavaScript并填充列表。

Fortunately you don't need the scrape the screen at all. 幸运的是,您根本不需要刮擦屏幕。 The page invokes an API to populate the list. 该页面调用一个API来填充列表。 You can just call the API yourself. 您可以自己调用API。

var request = require('request');

request.post('https://bitskins.com/api/v1/get_last_chat_messages', function (error, response, data) {
  if (!error && response.statusCode == 200) {
      var dataObject = JSON.parse(data);
      dataObject.data.messages.forEach(function (message) {
          // For some reason the message is JSON encoded as a string...
          var messageObject = JSON.parse(message);
          // The message object has "message" field.
          // Just use a regex to parse out the link and the price.
          var link = messageObject.message.match(/href='([^']+)/)[1];
          var price = messageObject.message.match(/\$(\d+\.\d+)/)[1];
          console.log(link + " " + price);
      });
  }
});

You probably will want to add better error-handling, convert the price into a number, etc. 您可能希望添加更好的错误处理,将价格转换为数字,等等。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM