简体   繁体   中英

My Scrapy spider can't extract data from the next page

so I am asked to scrape all the job details from a website, however my spider succeeds to get the link to the next page but only extracts the the data from the first one
This is my spider:

name = 'jobs'
allowed_domains = ['www.tanitjobs.com/jobs']
start_urls = ['https://www.tanitjobs.com/jobs']

def parse(self, response):
    pass

    all_jobs = response.css(".listing-item__jobs")

    for job in all_jobs:
        item = {
            'jobname' : job.css("article.listing-item div.listing-item__title a::text").getall(),
            "companyname" : job.css(".listing-item__info--item-company::text").extract(),
            "city" : job.css(".listing-item__info--item-location::text").extract() ,
            }

        yield item

    next_page = response.css(".pad_right_small a ::attr(href)").extract_first()
    if next_page:
       next_page = response.urljoin(next_page)
       yield scrapy.Request(url=next_page, callback=self.parse)

This is the result that i got after running the spider

if anyone knows what seems to be the problem, i really need your help and thanks in advance.

allowed_domains = ['www.tanitjobs.com/jobs']

as is a dead giveaway by its variable name, one should only put allowed domains in that list, and what you have is a partial URL in there, which causes the offsite filter to reject the request

Unless you have a specific need otherwise, I would suggest only listing the base domain in that value:

allowed_domains = ['tanitjobs.com']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM