简体   繁体   中英

Scraping websites with Scrapy

I have a code.

import scrapy
import requests



class cvbankas(scrapy.Spider):
    name ='bankas'
    allowed_domains =['cvbankas.lt']
    start_urls = ['https://www.cvbankas.lt/']

    def parse(self,response):        
        job_position_tag = response.css("h3.list_h3::text").extract()
        city_tag = response.css("span.list_city::text").extract()
        company_tag = response.css("span.dib.mt5::text").extract()
        salary_tag = response.css("span.salary_amount::text").extract()



        for item in zip(job_position_tag,city_tag,company_tag,salary_tag):
            scraped_info={
                'company':company_tag,
                'city': city_tag,
                'position': job_position_tag,
                'salary': salary_tag,
            }

            yield scraped_info
        
        next_page = response.css('li > a::attr(href)').extract_first()
        if next_page:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(url = next_page, callback = self.parse)   

And I don't know why it scrapes only 3 pages

Output marked in red is only 3 pages of 88

where's the problem in pagination?

Your selector was finding the first <a> tag he could find, which was the language <a> tag. You were changing languages not pages.

import scrapy
import requests



class cvbankas(scrapy.Spider):
    name ='bankas'
    allowed_domains =['cvbankas.lt']
    start_urls = ['https://www.cvbankas.lt/']

    def parse(self,response):        
        job_position_tag = response.css("h3.list_h3::text").extract()
        city_tag = response.css("span.list_city::text").extract()
        company_tag = response.css("span.dib.mt5::text").extract()
        salary_tag = response.css("span.salary_amount::text").extract()



        for item in zip(job_position_tag,city_tag,company_tag,salary_tag):
            scraped_info={
                'company':company_tag,
                'city': city_tag,
                'position': job_position_tag,
                'salary': salary_tag,
            }
            yield scraped_info
            
        
        next_page = response.xpath('//a[@class="prev_next"]/@href').extract()[-1]
        if next_page:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(url = next_page, callback = self.parse)

I looks like the website that you are scraping uses the url format uri?page=x a simple loop to replace x can solve your problems.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM