简体   繁体   English

用X射线刮取数据。 多个子页面

[英]Scraping data with X-ray. Multiple sub pages

I'm trying to scrape www.metacritic.com for some data to create training module. 我正在尝试从www.metacritic.com抓取一些数据来创建培训模块。

I'm able to use x-ray to scrape a single page but this particular page has a lot of subpages (for the 'letter categories'). 我可以使用X射线刮擦单个页面,但是这个特定页面有很多子页面(用于“字母类别”)。 I've tried to loop through several letters and perform multiple scrape events but i'm having trouble writing to a 'results.json' file using fs.appendFile. 我试图遍历几个字母并执行多个刮擦事件,但是我无法使用fs.appendFile写入“ results.json”文件。

I need a way to scrape THEN write to my file (its currently just wrong both functions immediately. 我需要一种方法来将THEN写入我的文件中(目前,这两个函数立即都出错了。

var Xray = require('x-ray');
var xray = Xray({
  filters: {
    trim: function (value) {
      return typeof value === 'string' ? value.trim() : value
    },
    reverse: function (value) {
      return typeof value === 'string' ? value.split('').reverse().join('') : value
    },
    slice: function (value, start , end) {
      return typeof value === 'string' ? value.slice(start, end) : value
    }
  }
});

var request = require('request');
var fs = require('fs')
var letters = ['a','b','c','d']
var resultObj = []

function eraseFile() {
  fs.writeFile('results.json', '', function() {console.log('Erased')})
}

eraseFile();

for (i = 0; i < letters.length; i++) {
  xray('https://www.metacritic.com/browse/tv/title/all/' + letters[i], 'li.season_product', [{
    title: '.product_title | trim',
    score: '.metascore_w',
    url: 'a@href'
  }])
  .paginate('.flipper.next a@href')
  (function(err, obj) {
    if (err) { console.log(err) }
    resultObj.concat(obj)
  })
}

fs.appendFile('results.json', JSON.stringify(resultObj), function(err) {
  if (err) { console.log(err) }

  console.log('scraped data saved to results.json')
})

You are not using promises correctly. 您没有正确使用诺言 When you write to file the asynchronous code has not finished yet. 当您写入文件时,异步代码尚未完成。

You could just not use the one resultObj and append each result to the file as they are received. 您不能只使用一个resultObj并将每个结果附加到文件中,因为它们将被接收。 A problem that still exist is that you are bombarding the site with requests and the site may block you or see your requests as a ddos attack. 仍然存在的问题是,您正在用请求轰炸该站点,并且该站点可能阻止您或将您的请求视为ddos攻击。 I can provide a throttled example if you need it but without throttling the code would look something like this: 如果需要,我可以提供一个受限制的示例,但不限制其代码将如下所示:

var Xray = require('x-ray');
var xray = Xray({
  filters: {
    trim: function (value) {
      return typeof value === 'string' ? value.trim() : value
    },
    reverse: function (value) {
      return typeof value === 'string' ? value.split('').reverse().join('') : value
    },
    slice: function (value, start , end) {
      return typeof value === 'string' ? value.slice(start, end) : value
    }
  }
});

var request = require('request');
var fs = require('fs')
var letters = ['a','b','c','d']

function eraseFile() {
  fs.writeFile('results.json', '', function() {console.log('Erased')})
}

eraseFile();

const makeXrayRequestFunction =
  letter =>
  () =>
    xray('https://www.metacritic.com/browse/tv/title/all/' + letter, 'li.season_product', [{
      title: '.product_title | trim',
      score: '.metascore_w',
      url: 'a@href'
    }])
    .paginate('.flipper.next a@href')
;
const handleXrayFinishedRequest =
  obj =>
    //resultObj.concat(obj);//doing nothing with obj
    new Promise(//append single result to file 
      (resove,reject) =>{
        fs.appendFile(
          'results.json'
          , JSON.stringify(obj)
          , err =>
            err? 
              reject(err) //could not write to file, return rejected promise
              : resolve(obj) //could write to file, return obj
        )
      }
    )
;
const failedXrayRequest = //just log the error when failed
  err =>
    console.log("failed:",err)
;
Promise.all(
  letters.map( //map letters array to functions that when called will have xray make the request
    makeXrayRequestFunction
  )
  .map(
    xRayFunction =>
      xRayFunction() //call the xray function, this should return a promise
      .then(
        handleXrayFinishedRequest //xray request was successfull, try to append to file
      )
      .then(
        undefined
        ,failedXrayRequest //either xray or file writing failed, handle it
      )
    )
)
.then(
  resultObj =>
    console.log("Finished scraping")
)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM