简体   繁体   English

Node.js 网页抓取问题 | 请求 | 啦啦队

[英]Problems with Node.js web scraping | request | cheerio

I am writing a fairly simple web scraper using Node.js and the request module and the cheerio module.我正在使用 Node.js 和 request 模块以及cheerio 模块编写一个相当简单的网络爬虫。 My code doesn't work as I want it to for two reasons:我的代码无法正常工作,原因有两个:

  1. When trying to scrape the image url I am only being returned with a single url multiple times for each page.当尝试抓取图像 url 时,我只为每个页面多次返回一个 url。
  2. The iteration of each 'href' and 'title' happens in a seemingly random order (it is the same order each time but is still not in order eg 1, 2, 3 etc.)每个 'href' 和 'title' 的迭代以看似随机的顺序发生(每次都是相同的顺序,但仍然没有顺序,例如 1、2、3 等)

Here is my code:这是我的代码:

var request = require('request'),
    cheerio = require('cheerio');

var sqlite3 = require('sqlite3').verbose();
var database = "storage.db"
console.log('[+] Creating database: ' + database);
var db = new sqlite3.Database(database);

var pw_url = "https://primewire.unblocked.ink"

console.log('[+] Creating table with rows...');
db.serialize(function() {
  db.run("CREATE TABLE IF NOT EXISTS main (title TEXT, film_page_links TEXT, img_url TEXT)");
});

var img_urls = {}

function iter_pages(page_number) {
  request(pw_url + '/index.php?sort=featured&page=' + page_number, function(err, resp, body) {
    if(!err && resp.statusCode == 200) {
      console.log('[+] The request response status code is: ' + resp.statusCode);
      var $ = cheerio.load(body);
      console.log('[+] Inserting values into database.');
      $('.index_item a img', '.index_container').each(function() {
        img_urls.img_url = $(this).attr('src');
      });
      $('.index_item a', '.index_container').each(function() {
        var url = $(this).attr('href');
        var title = $(this).attr('title');
        if(url.startsWith('/watch-')) {
          //urls.push('https://primewire.unblocked.ink' + url);
          db.run("INSERT INTO main (title, film_page_links, img_url) VALUES (?, ?, ?)",
                  title.replace("Watch ", ""),
                  pw_url + url,
                  "https:" + img_urls.img_url);
        };
      });
      console.log('[+] Processed page:' + page_number);
    }
  });
}

for (var i = 1; i < 5; i++) {
    iter_pages(i);
}

Here is my console.log:这是我的 console.log:

[+] Creating database: storage.db
[+] Creating table with rows...
[+] The request response status code is: 200
[+] Inserting values into database.
[+] Processed page:4
[+] The request response status code is: 200
[+] Inserting values into database.
[+] Processed page:1
[+] The request response status code is: 200
[+] Inserting values into database.
[+] Processed page:3
[+] The request response status code is: 200
[+] Inserting values into database.
[+] Processed page:2

As you can see it goes in order 4, 1, 3, 2 which confuses me.正如你所看到的,它按照 4、1、3、2 的顺序进行,这让我很困惑。

The image url it returns is always the 21st item of each page.它返回的图像 url 始终是每个页面的第 21 项。

I am new to JavaScript so please be kind, I have tried moving the method the fetch the image url around within the iter_pages function which either breaks the code or returns the same thing.我是 JavaScript 的新手,所以请善待,我已经尝试在 iter_pages 函数中移动获取图像 url 的方法,这会破坏代码或返回相同的内容。

Even a link to a more advanced tutorial would suffice, I learn things very quick but the problem is all tutorials I have found are only very basic techniques.即使是更高级教程的链接也足够了,我学东西很快,但问题是我发现的所有教程都只是非常基本的技术。

First problem:第一个问题:

This is how you set the image url: img_urls.img_url = ... .这是您设置图像 url 的方式: img_urls.img_url = ...

What's happening is that everytime you set, you put it in the same property and overwrite what was there, so that's why it's always the last one from the page.发生的情况是每次设置时,您都将它放在相同的属性中并覆盖那里的内容,这就是为什么它总是页面中的最后一个。 You could try to fix it by pushing into an array, but since you have two loops, it makes things much more complicated, instead try to do both in the same loop:您可以尝试通过推入数组来修复它,但是由于您有两个循环,这会使事情变得更加复杂,而是尝试在同一个循环中执行这两个操作:

 $('.index_item a', '.index_container').each(function() {
    var url = $(this).attr('href');
    var title = $(this).attr('title');
    var img_url = $('img', this).attr('src');
    if(url.startsWith('/watch-')) {
      //urls.push('https://primewire.unblocked.ink' + url);
      db.run("INSERT INTO main (title, film_page_links, img_url) VALUES (?, ?, ?)",
              title.replace("Watch ", ""),
              pw_url + url,
              "https:" + img_url);
    };
  });

Second problem:第二个问题:

What you have to realize couple of things.你必须意识到几件事。 request(...) is making an asynchronous network request. request(...)正在发出异步网络请求。 It means this function immediately finishes, which the result is not have arrived yet.这意味着该函数立即完成,结果尚未到达。 So the loop keeps going and all the network request start at the same time, but then thanks to a lot of different variables and luck some of those network requests finish at different times.所以循环继续进行,所有网络请求同时开始,但由于有很多不同的变量和运气,其中一些网络请求在不同的时间完成。 Some might be faster, some slower.有些可能更快,有些可能更慢。 Since they were all nearly started at the same time, the order that they were started wouldn't matter much.由于它们几乎都是同时启动的,启动的顺序并不重要。 Here's your problem simplified:这是您的问题简化:

const request = require('request');

for (let i = 0; i < 5; i++) { 
  makeRequest(i);
}

function makeRequest(i) {
  console.log('Starting', i);
  console.time(i);
  request('http://google.com', () => console.timeEnd(i));
}

And here's the logs:这是日志:

$ node a.js
Starting 0
Starting 1
Starting 2
Starting 3
Starting 4
1: 8176.111ms
2: 8176.445ms
3: 8206.300ms
0: 8597.458ms
4: 9112.237ms

Running it again yields this:再次运行它会产生这样的结果:

$ node a.js
Starting 0
Starting 1
Starting 2
Starting 3
Starting 4
3: 8255.378ms
1: 8260.633ms
2: 8259.134ms
0: 8268.859ms
4: 9230.929ms

So you can see the order is not deterministic.所以你可以看到顺序不是确定性的。 Just some finishes faster than the others.只是有些完成得比其他人快。

If you really want them to happen in order, I suggest using a control flow library.如果您真的希望它们按顺序发生,我建议使用控制流库。 async.js is one of the most popular ones. async.js是最受欢迎的之一。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM