简体   繁体   English

网页抓取时请求和cheerio的问题

[英]Issues with request, and cheerio when web-scraping

I'm trying to write a code that makes a request to a website, for webscraping 我正在尝试编写一个向网站发出请求的代码,用于webscraping

So this are the steps: 这就是步骤:

Here First part of Code STARTS 这是Code STARTS的第一部分

  1. The program makes the request to the mainURL 该程序向mainURL发出请求
  2. The program selects some objects from the html of the mainURL , and store them in an array of objects(advert), on of the properties of the object, is it's link, which we'll call numberURL , that the code automatically selects using a css selector, the amount of objects is something like 80-90; 程序从mainURL的html中选择一些对象,并将它们存储在一个对象(广告)数组中,对象的属性,是它的链接,我们称之为numberURL ,代码自动选择使用css选择器,对象的数量是80-90;
  3. The program makes requests to every numberURL (80-90 requests), and for each of them it does set another properties to the same object, and selects another link, that we'll call accountURL 程序向每个numberURL (80-90个请求)发出请求,并为每个numberURL设置另一个属性到同一个对象,并选择另一个链接,我们称之为accountURL
  4. The program creates an CSV file where it writes every object in different rows 该程序创建一个CSV文件,在该文件中写入不同行中的每个对象

Here First part of Code ENDS 这里是Code ENDS的第一部分

So actually the first part works pretty good, it doesn't have any issues, but the second part does 所以实际上第一部分工作得很好,它没有任何问题,但第二部分确实如此

Here Second part of Code STARTS 这里是Code STARTS的第二部分

  1. The program makes requests to every accountURL from the previous object 该程序向前一个对象的每个accountURL发出请求
  2. The program selects some objects from the html of the accountURL , and stores them in an another array of another objects(account), also using CSS selectors 程序从accountURL的html中选择一些对象,并将它们存储在另一个对象(帐户)的另一个数组中,同样使用CSS选择器
  3. The program should console.log() all the account objects 该程序应该是console.log()所有帐户对象

Here Second part of Code ENDS 这里是Code ENDS的第二部分

But the second part does have some bugs, because when console.logging the objects we see that the objects properties doesn't changed their default value. 但是第二部分确实有一些错误,因为当console.logging对象时,我们看到对象属性没有改变它们的默认值。

So in debugging purposes I took one advert object and putted it's value manually from the code 因此,在调试目的中,我获取了一个广告对象,并从代码中手动推出了它的值

post[0].link = 'https://999.md/ru/profile/denisserj'

Finally when running the code for this object it actually works correctly, so it shows the changed properties, but for the rest of them it doesn't. 最后,当运行此对象的代码时,它实际上正常工作,因此它显示已更改的属性,但对于其余的属性,它不会。

I tried to set some Timeouts, thinking that the code tries to read the link, before the second request finished, but no effects 我尝试设置一些超时,认为代码尝试在第二个请求完成之前读取链接,但没有效果

I also tried to console.log the link, to see if it exists in the array, so it actually exists there, but also no effect. 我也尝试了console.log链接,看看它是否存在于数组中,所以它实际上存在于那里,但也没有效果。

Finally here is the code: 最后这里是代码:

// CLASSES
class advert {
    constructor() {
        this.id = 0;
        this.tile = new String();
        this.link = new String();
        this.phone = new String();
        this.account = new String();
        this.accountLink = new String();
        this.text = new String();
        this.operator = new String();
    }
    show() {
        console.log(this.id, this.title, this.link, this.phone, this.account, this.accountLink, this.text, this.operator);
    }

}
class account {
    constructor() {
        this.name = 0;
        this.createdAt = 0;
        this.phone = [];
        this.ads = [];
        this.adsNumber = 0;
    }
    show() {
        console.log(this.name, this.createdAt, this.phone, this.ads, this.adsNumber);
    }
}

// HEADERS
const mainRequest = require('request');
const auxRequest = require('request');
const cheerio1 = require('cheerio');
const cheerio2 = require('cheerio');
const fs = require('fs');
const fs2 = require('fs');
const adFile = fs.createWriteStream('anunturi.csv');
const accFile = fs2.createWriteStream('conturi.csv');

// SETTINGS
const host = 'https://999.md'
const category = 'https://999.md/ru/list/transport/cars'
const timeLimit = 60; //seconds

// VARIABLES
let post = [];
let postNumber = 0;
let acc = [];

// FUNCTIONS
function deleteFromArray(j) {
    post.splice(j, 1);
}

function number(i) {
    let category = post[i].link;
    auxRequest(category, (error, response, html) => {
        if (!error && response.statusCode == 200) {
            const $ = cheerio1.load(html);
            let phone;
            const siteTitle = $('strong').each((id, el) => {
                phone = $(el).text();
            });
            const txt = $('.adPage__content__description').html();
            const person = $('.adPage__header__stats').find('.adPage__header__stats__owner').text();
            const linkToPerson = host + $('.adPage__header__stats').find('.adPage__header__stats__owner').find('a').attr('href');
            post[i].phone = phone;
            post[i].account = person;
            post[i].accountLink = linkToPerson;
            post[i].text = txt;
            if (i == postNumber) {
                console.log('1. Number Putting done')
                writeToFileAd(accountPutter, writeToFileAccount);
            }
        }

    });
}

function writeToFileAd() {
    adFile.write('ID, Titlu, Link, Text, Cont, LinkCont, Operator\n')
    for (let i = 0; i <= postNumber; i++) {
        adFile.write(`${post[i].id}, ${post[i].title}, ${post[i].link}, ${post[i].phone}, ${post[i].account}, ${post[i].accountLink}, ${post[i].operator}\n`);
    }
    console.log('2. Write To File Ad done')
    accountPutter();
}

function accountAnalyzis(i) {
    let category = post[i].link;
    const mainRequest = require('request');
    category = category.replace('/ru/', '/ro/');
    mainRequest(category, (error, response, html) => {

        if (!error && response.statusCode == 200) {
            const $ = cheerio2.load(html);
            const name = $('.user-profile__sidebar-info__main-wrapper').find('.login-wrapper').text();
            let createdAt = $('.date-registration').text();
            createdAt = createdAt.replace('Pe site din ', '');
            const phones = $('.user-profile__info__data').find('dd').each((id, el) => {
                let phone = $(el).text();
                acc[i].phone.push(phone);
            });
            const ads = $('.profile-ads-list-photo-item-title').find('a').each((id, el) => {
                let ad = host + $(el).attr('href');
                acc[i].ads.push(ad);
                acc[i].adsNumber++;
            });
            acc[i].name = name;
            acc[i].createdAt = createdAt;
            console.log(name)
            if (i == postNumber) {
                console.log('3. Account Putting done')
                writeToFileAccount();
            }
        }
    });
}

function writeToFileAccount() {
    for (let i = 0; i <= postNumber; i++) {
        accFile.write(`${acc[i].name}, ${acc[i].createdAt}, ${acc[i].phone}, ${acc[i].ads}, ${acc[i].adsNumber}\n`);
    }
    console.log('4. Write to file Account done');
}

function numberPutter() {
    for (let i = 0; i <= postNumber; i++) {
        number(i);
    }
}

function accountPutter() {
    for (let i = 0; i <= postNumber; i++) {
        accountAnalyzis(i);
    }
}

// MAIN
mainRequest(category, (error, response, html) => {
    let links = [];
    for (let i = 0; i < 1000; i++) {
        post[i] = new advert();
    }
    for (let i = 0; i < 1000; i++) {
        acc[i] = new account();
    }
    if (!error && response.statusCode == 200) {
        const $ = cheerio2.load(html);
        const siteTitle = $('.ads-list-photo-item-title').each((id, el) => {
            const ref = host + $(el).children().attr('href');
            const title = $(el).text();
            post[id].id = id + 1;
            post[id].title = title;
            post[id].link = ref;
            links[id] = ref;
            postNumber = id;
        });
        post[0].link = 'https://999.md/ru/profile/denisserj'
        numberPutter()
    }

});

You have an error in line 你有一个错误

const siteTitle = $('.ads-list-photo-item-title').each((id, el) => {

What you actually want is .find('a').each... 你真正想要的是.find('a').each...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM