简体   繁体   English

如何解析多个页面?

[英]How do I parse multiple pages?

I have been attempting to parse a sites table data into a json file, which I can do if I do each page one by one, but seeing as there are 415 pages that would take a while. 我一直试图将一个sites表数据解析成一个json文件,如果我一页一页地做每一页,我可以做,但是看到有415页,这需要一段时间。

I have seen and read a lot of StackOverflow questions on this subject but I don't seem able to modify my script so that it; 我已经阅读并阅读了许多关于此主题的StackOverflow问题,但似乎无法修改脚本,因此我无法修改它。

  1. Scrapes each page and extracts the 50 items with item IDS per page 刮取每页并提取每页有IDS的50个项目
  2. Do so in a rate limited way so I don't negatively affect the server 以限制速率的方式执行此操作,以免对服务器造成负面影响
  3. The script waits until all requests are done so I can write each item + item id to a JSON file. 该脚本会一直等到所有请求都完成,以便我可以将每个项目+项目ID写入JSON文件。

I believe you should be able to do this using request-promise and promise.all but I cannot figure it out. 我相信您应该可以使用request-promise和promise.all来做到这一点,但我无法弄清楚。

The actual scraping of the data is fine I just cannot make the code, scrape a page, then go to the next URL with a delay or pause inbetween requests. 实际抓取数据很好,我只是无法制作代码,抓取页面,然后在请求之间延迟或暂停而进入下一个URL。 Code below is the closest I have got, but I get the same results multiple times and I cannot slow the request rate down. 下面的代码是我获得的最接近的代码,但是我多次获得相同的结果,因此无法降低请求速率。

Example of the page URLS: 页面URLS的示例:

  1. http://test.com/itemlist/1 http://test.com/itemlist/1
  2. http://test.com/itemlist/2 http://test.com/itemlist/2
  3. http://test.com/itemlist/3 etc (upto 415) http://test.com/itemlist/3等(最多415)

     for (var i = 1; i <= noPages; i++) { urls.push({url: itemURL + i}); console.log(itemURL + i); } Promise.map(urls, function(obj) { return rp(obj).then(function(body) { var $ = cheerio.load(body); //Some calculations again... rows = $('table tbody tr'); $(rows).each(function(index, row) { var children = $(row).children(); var itemName = children.eq(1).text().trim(); var itemID = children.eq(2).text().trim(); var itemObj = { "id" : itemID, "name" : itemName }; itemArray.push(itemObj); }); return itemArray; }); },{concurrency : 1}).then(function(results) { console.log(results); for (var i = 0; i < results.length; i++) { // access the result's body via results[i] //console.log(results[i]); } }, function(err) { // handle all your errors here console.log(err); }); 

Apologies for perhaps misunderstand node.js and its modules, I don't really use the language but I needed to scrape some data and I really don't like python. 道歉也许是误解了node.js及其模块,我并没有真正使用该语言,但是我需要抓取一些数据,而且我真的不喜欢python。

since you need requests to be run only one by one Promise.all() would not help. 因为您只需要一个一个Promise.all()来运行请求,所以无济于事。 Recursive promise (I'm not sure if it's correct naming) would. 递归承诺(我不确定它的命名是否正确)。

function fetchAllPages(list) {
    if (!list || !list.length) return Promise. resolve(); // trivial exit
    var urlToFetch = list.pop();
    return fetchPage(urlToFetch).
        then(<wrapper that returns Promise will be resolved after delay >).
        then(function() {
            return fetchAllPages(list); // recursion! 
        });
}

This code still lacks error handling. 此代码仍然缺少错误处理。 Also I believe it can become much more clear with async/await: 我也相信使用async / await可以使它变得更加清晰:

for(let url of urls) {
    await fetchAndProcess(url);
    await <wrapper around setTimeout>;
}

but you need to find /write your own implementation of fetch() and setTimeout() that are async 但您需要查找/编写自己的async fetch()setTimeout()实现

After input from @skyboyer suggesting using recursive promises I was lead to a GitHub Gist called Sequential execution of Promises using reduce() 在@skyboyer的建议使用递归诺言的输入之后,我被引导到一个名为Gise的GitHub Gist ,它使用reduce()顺序执行Promises。

Firstly I created my array of URLS 首先,我创建了我的URL数组

for (var i = 1; i <= noPages; i++) {
    //example urls[0] = "http://test.com/1"
    //example urls[1] = "http://test.com/2"
    urls.push(itemURL + i);
    console.log(itemURL + i);
}

Then 然后

       var sequencePromise = urls.reduce(function(promise, url) {
         return promise.then(function(results) {
        //fetchIDsFromURL async function (it returns a promise in this case) 
         //when the promise resolves I have my page data
         return fetchIDsFromURL(url)
        .then(promiseWithDelay(9000))
        .then(itemArr => {
          results.push(itemArr);
          //calling return inside the .then method will make sure the data you want is passed onto the next
          return results;
        });
    });
}, Promise.resolve([]));



// async
function fetchIDsFromURL(url)
{
  return new Promise(function(resolve, reject){
    request(url, function(err,res, body){
      //console.log(body);
      var $ = cheerio.load(body);
      rows = $('table tbody tr');
      $(rows).each(function(index, row) {
        var children = $(row).children();
        var itemName = children.eq(1).text().trim();
        var itemID = children.eq(2).text().trim();
        var itemObj = {
          "id" : itemID,
          "name" : itemName
        };
        //push the 50 per page scraped items into an array and resolve with 
        //the array to send the data back from the promise
        itemArray.push(itemObj);
      });
      resolve(itemArray);
    });
 });
}

//returns a promise that resolves after the timeout
function promiseWithDelay(ms)
{
  let timeout =  new Promise(function(resolve, reject){
    setTimeout(function()
    {
      clearTimeout(timeout);
      resolve();
    }, ms);
  });

  return timeout;
}

Then finally call .then on the sequence of promises, the only issue I had with this was returning multiple arrays inside results with the same data in each, so since all data is the same in each array I just take the first one which has all my parsed items with IDs in it, then I wrote it to a JSON file. 然后最后调用.then,然后在promise序列上,我唯一遇到的问题是返回结果内部的多个数组,每个数组中的数据均相同,因此,由于每个数组中的所有数据均相同,因此我只取第一个包含所有数据的数组我解析的项目中有ID,然后将其写入JSON文件。

  sequencePromise.then(function(results){
  var lastResult = results.length;
  console.log(results[0]);
  writeToFile(results[0]);
});

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM