Scraping websites with Scrapy

Question

I have a code.

import scrapy
import requests



class cvbankas(scrapy.Spider):
    name ='bankas'
    allowed_domains =['cvbankas.lt']
    start_urls = ['https://www.cvbankas.lt/']

    def parse(self,response):        
        job_position_tag = response.css("h3.list_h3::text").extract()
        city_tag = response.css("span.list_city::text").extract()
        company_tag = response.css("span.dib.mt5::text").extract()
        salary_tag = response.css("span.salary_amount::text").extract()



        for item in zip(job_position_tag,city_tag,company_tag,salary_tag):
            scraped_info={
                'company':company_tag,
                'city': city_tag,
                'position': job_position_tag,
                'salary': salary_tag,
            }

            yield scraped_info
        
        next_page = response.css('li > a::attr(href)').extract_first()
        if next_page:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(url = next_page, callback = self.parse)

And I don't know why it scrapes only 3 pages

Output marked in red is only 3 pages of 88

where's the problem in pagination?

Answer 1

Your selector was finding the first <a> tag he could find, which was the language <a> tag. You were changing languages not pages.

import scrapy
import requests



class cvbankas(scrapy.Spider):
    name ='bankas'
    allowed_domains =['cvbankas.lt']
    start_urls = ['https://www.cvbankas.lt/']

    def parse(self,response):        
        job_position_tag = response.css("h3.list_h3::text").extract()
        city_tag = response.css("span.list_city::text").extract()
        company_tag = response.css("span.dib.mt5::text").extract()
        salary_tag = response.css("span.salary_amount::text").extract()



        for item in zip(job_position_tag,city_tag,company_tag,salary_tag):
            scraped_info={
                'company':company_tag,
                'city': city_tag,
                'position': job_position_tag,
                'salary': salary_tag,
            }
            yield scraped_info
            
        
        next_page = response.xpath('//a[@class="prev_next"]/@href').extract()[-1]
        if next_page:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(url = next_page, callback = self.parse)

Answer 2

I looks like the website that you are scraping uses the url format uri?page=x a simple loop to replace x can solve your problems.

Scraping websites with Scrapy

Question

2 answers

solution1
1 2020-06-22 12:24:11

solution2
0 2020-06-22 11:51:05

Scraping websites with Scrapy

Question

2 answers

solution1 1 2020-06-22 12:24:11

solution2 0 2020-06-22 11:51:05

solution1
1 2020-06-22 12:24:11

solution2
0 2020-06-22 11:51:05