简体   繁体   中英

Scrapy callback function doesn't work “sometimes”

I'm trying to parse a job finding web. The process is like this;

  1. request the first job-list page (def start_request )
  2. parse the job-list page via callback to parse_list function
  3. for each job url in the job list, log the 'requesting {url}', then request via callback to parse_detail function. The log looks something like this

2018-04-21 13:49:54,211: - [ JobPageRequest ] https://www.jobant.com/job-3998

  1. the parse_detail function logs that the parse_detail has been called successfully, then start parsing detail. The log looks something like this

2018-04-21 13:52:57,494:jobant - [ JobPageParsing ] https://www.jobant.com/job-3998

  1. find the next page link in current job-list page, if it exists, goto 2, else job ends.

The problem is, the callback doesn't work sometimes .
The job finding website contains 64 jobs, but I get only 49 jobs, so I look at my log.
The [ JobPageRequest ] has been logged exactly 64 times, the same of amount of jobs in the website, but the [ JobPageParsing ] has been logged only 49 times.

I have tried this several times, and result is exactly the same, 49 pages out of 64 times. The urls that has not been called are exactly the same too, however there is no specific pattern/difference to the pages that has been called successfully in anyway I can see.

So, it seems to me that those specific pages are not called for some reasons.

Here is the relevant parts of the code.

the start_requests

def start_requests(self):
    '''start first request on a job-list page'''
    url = "https://www.jobant.com/jobs-search.php?s_jobtype={job_type}&s_province={province}&page={page}"
    job_type  = self.job_type if hasattr(self,'job_type') else ''
    province = self.province if hasattr(self,'province') else ''
    formatted_url = url.format(page=self.page, job_type=job_type, province=province)

    self.logger.info('[ JobListRequest ] {url}'.format(url=formatted_url.encode('utf-8')))

    # callback to parse_list
    yield scrapy.Request(url=formatted_url.encode('utf-8'), callback=self.parse_list)

the parse_list

def parse_list(self, response):

    if self.killed:
        raise CloseSpider("Spider already died.")

    ### getting job urls from job list page.
    jobs = response.xpath('//div[@class="item"]/div/div/div/a/@href').extract()

    ### for each job page, request for html
    for job_id in jobs:
        url = urljoin("https://www.jobant.com/",job_id) 
        # the use_proxy is hard-coded as False atm
        if self.use_proxy:
            proxy = choice(self.proxies)
            self.logger.info('[ JobPageRequest ] {url} with proxy {proxy}'.format(url=url.encode('utf-8'), proxy=proxy))
            yield scrapy.Request(url, callback=self.parse_detail , meta={'proxy': proxy})
        else:
            self.logger.info('[ JobPageRequest ] {url}'.format(url=url.encode('utf-8')))
            # callback to parse_detail
            yield scrapy.Request(url, callback=self.parse_detail)

    # the rest is about finding next job-list page

The parse detail part is not really important, the only relevant part is I start logging the first thing inside the function

def parse_detail(self, response):

    self.logger.info('[ JobPageParsing ] {url}'.format(url=response.url.encode('utf-8')))

    ## .. The rest is not relevant

here is my full code in case the error is somewhere else.

import scrapy
from datetime import datetime
from scrapy.utils.markup import remove_tags
from scrapy.http import FormRequest
from urlparse import urljoin
from scrapy.exceptions import CloseSpider
from random import choice
from hasher import hash_dn
from sqlalchemy import exc

class TDRISpider(scrapy.Spider):
    custom_settings = {
        'HTTPPROXY_ENABLED': True 
    }
    name        = "jobant"
    page        = 1
    web_id      = 1

    ## some variables set up by a factory script on run.
    logger      = None
    sqllogger   = None
    html_path   = None
    max_page    = 9999
    use_proxy   = False
    proxies     = []

    ## variables to track repeat / error
    repeat_count     = 0
    repeat_threshold = 3

    error_count      = 0
    error_threshold  = 5

    killed      = 0

    def start_requests(self):
        '''start first request on a job-list page'''
        url = "https://www.jobant.com/jobs-search.php?s_jobtype={job_type}&s_province={province}&page={page}"
        job_type  = self.job_type if hasattr(self,'job_type') else ''
        province = self.province if hasattr(self,'province') else ''
        formatted_url = url.format(page=self.page, job_type=job_type, province=province)

        self.logger.info('[ JobListRequest ] {url}'.format(url=formatted_url.encode('utf-8')))

        yield scrapy.Request(url=formatted_url.encode('utf-8'), callback=self.parse_list)

    def clean_tag(self,s):
        return ' '.join([x.strip() for x in remove_tags(s).split()])

    def parse_list(self, response):

        if self.killed:
            raise CloseSpider("Spider already died.")

        ### getting job urls from job list page.
        jobs = response.xpath('//div[@class="item"]/div/div/div/a/@href').extract()

        ### for each job page, request for html
        for job_id in jobs:
            url = urljoin("https://www.jobant.com/",job_id) 
            if self.use_proxy:
                proxy = choice(self.proxies)
                self.logger.info('[ JobPageRequest ] {url} with proxy {proxy}'.format(url=url.encode('utf-8'), proxy=proxy))
                yield scrapy.Request(url, callback=self.parse_detail , meta={'proxy': proxy})
            else:
                self.logger.info('[ JobPageRequest ] {url}'.format(url=url.encode('utf-8')))
                yield scrapy.Request(url, callback=self.parse_detail)

        ### getting next job list page url
        next_url = response.xpath('//ul[@class="pagination"]//a/@href').extract()
        if len(next_url) == 2:
            next_url = next_url[-1]
        elif len(next_url) == 1 and self.page <2:
            next_url = next_url[0]
        else:
            next_url = None

        ### request next job list, if it exists
        if next_url and self.page <= self.max_page:
            next_page = urljoin("https://www.jobant.com/",next_url)
            self.page += 1
            self.logger.info('[ JobListRequest ] {url}'.format(url=next_page.encode('utf-8')))
            yield scrapy.Request(url=next_page.encode('utf-8'), callback=self.parse_list)
        elif next_url:
            self.logger.info('[ JobEndReached ] Max page reached at # %d' % self.max_page)
            raise CloseSpider("Max page reached")
        else:
            self.logger.info('[ JobEndReached ] Last page reached at # %d' % self.page)
            raise CloseSpider("Last page reached")

    def parse_detail(self, response):

        self.logger.info('[ JobPageParsing ] {url}'.format(url=response.url.encode('utf-8')))

        if self.killed:
            raise CloseSpider("Spider already died.")

        ### handle the case when response from web server is empty
        # retry requesting, after 5 failures in a row, log error then continue.
        if not response.body:
            self.error_count += 1

            if self.error_count >= self.error_threshold:
                self.logger.error('[ JobPageRequestException ] {url}'.format(url=response.url.encode('utf-8')))
                self.sqllogger.log_error_page(
                    hash_code    = hash_dn(response.url.encode('utf-8'),datetime.now().strftime('%Y%m%d%H%M%S')),
                    web_id       = self.web_id,
                    url          = response.url.encode('utf-8'),
                    meta         = response.meta,
                    html_path    = html_path,
                    crawl_time   = datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
                    job_status   = 'FAILED',
                    error_message= "Empty request's response"
                )
                yield None
                return
            if self.use_proxy:
                proxy = choice(self.proxies)
                self.logger.info('[ JobPageRetry ] {url} with proxy {proxy}'.format(url=response.url.encode('utf-8'), proxy=proxy))
                yield scrapy.Request(response.url.encode('utf-8'), callback=self.parse_detail , meta={'proxy': proxy})
                return
            else:
                self.logger.info('[ JobPageRetry ] {url}'.format(url=url.encode('utf-8')))
                yield scrapy.Request(response.url.encode('utf-8'), callback=self.parse_detail)
                return
        self.error_count     = 0
        ###

        ### writing html archive
        try:
            html_path = self.html_path.format(dttm=datetime.now().strftime('%Y%m%d_%H%M%S'))
            with open(html_path, 'w') as f:
                f.write(response.text.encode('utf-8'))
                self.logger.info('[ HTMLArchived ] {url}'.format(url=response.url.encode('utf-8')))
        except Exception as e:
            self.logger.error('[ HTMLArchiveException ] {url}'.format(url=response.url.encode('utf-8')))
        ###

        try:
            ### parsing information
            contents         = response.xpath('.//div[@class="wrapper-preview-list"]/div[contains(@class,"row tr")]/div[contains(@class,"col-sm")]')
            content_str      = [self.clean_tag(content.xpath('./div/div')[1].extract()) for content in contents[:10]]

            pos, company     = [x.strip() for x in response.xpath('//h1[@class="title-section c4 xs-mt5"]/text()').extract_first().split(',',1)]

            ret = {}

            ret['company']   = company
            ret['pos']       = pos
            ret['etype']     = content_str[1]
            ret['indus']     = content_str[2]
            ret['amnt']      = content_str[3]
            ret['sal']       = content_str[4]
            ret['exp']       = content_str[5]
            ret['sex']       = content_str[6]
            ret['edu']       = content_str[7]
            ret['loc']       = content_str[8]
            ret['desc']      = '|'.join([x.strip() for x in contents[11].xpath('./text()').extract()])
            ret['pdate']     = response.xpath('//span[@itemprop="datePosted"]/text()').extract_first()

            if ret['pdate'].split('/')[-1] == "2017":
                self.logger.info("[ JobEndReached ] 2017 reached")
                self.killed  = 1
                raise CloseSpider("2017 reached")

            for key in ret.keys():
                if ret[key]:
                    ret[key] = ret[key].strip().replace('%','%%').encode('utf-8')
            ###

            # create hash for tracking jobs
            _hash = hash_dn(ret['desc'],ret['company']) 

            ### log result to MySQL
            try:
                self.sqllogger.log_crawled_page(
                    hash_code    = _hash,
                    position     = ret['pos'],
                    employer     = ret['company'],
                    exp          = ret['exp'],
                    salary       = ret['sal'],
                    location     = ret['loc'],
                    web_id       = self.web_id,
                    url          = response.url.encode('utf-8'),
                    meta         = response.meta,
                    html_path    = html_path,
                    crawl_time   = datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
                    post_time    = ret['pdate'],
                    job_status   = 'SUCCESS',
                    error_message= ''
                )
                self.logger.info('[ RDSLogged ] {url}'.format(url=response.url.encode('utf-8')))
            except exc.IntegrityError as e:
                ### check encountering old record by catching error that mysql will throw
                # if old record is met. (primary key(hash) is repeating)
                # The error code for such error is 1062
                ### Stop spider after encountering crawled record 3 times IN A ROW.
                # to prevent spider stopping just from getting a few old records
                # that may happen because of new job updates
                if e.orig.args[0] == 1062 and self.repeat_count >= self.repeat_threshold:
                    self.logger.info("[ JobEndReached ] crawled record reached exceeding threshold")
                    self.killed = 1
                    raise CloseSpider("Crawled record reached")
                elif e.orig.args[0] == 1062 and self.repeat_count < self.repeat_threshold:
                    self.repeat_count += 1
                    self.logger.info("[ JobRepeat ] crawled record found within threshold #%d" % self.repeat_count)
                    yield None
                    return
                else:
                    raise e
                ###
            self.repeat_count = 0
            ###

            yield ret

        except CloseSpider as e:
            raise CloseSpider(e.message)

        except Exception as e:
            self.logger.error('[ JobDetailException ] {url} {html_path} {e}'.format(url=response.url.encode('utf-8'),html_path=html_path.encode('utf-8'),e=e))
            self.sqllogger.log_error_page(
                hash_code    = hash_dn(response.url.encode('utf-8'),datetime.now().strftime('%Y%m%d%H%M%S')),
                web_id       = self.web_id,
                url          = response.url.encode('utf-8'),
                meta         = response.meta,
                html_path    = html_path,
                crawl_time   = datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
                job_status   = 'FAILED',
                error_message= e
            )

So during I was typing the question, I recognized my error, which is rather silly but might be useful to anyone else.

In parse_list function, I have this part of code that detect last job-list page

if next_url and self.page <= self.max_page:
    next_page = urljoin("https://www.jobant.com/",next_url)
    self.page += 1
    self.logger.info('[ JobListRequest ] {url}'.format(url=next_page.encode('utf-8')))
    yield scrapy.Request(url=next_page.encode('utf-8'), callback=self.parse_list)
elif next_url:
    self.logger.info('[ JobEndReached ] Max page reached at # %d' % self.max_page)
    raise CloseSpider("Max page reached")
else:
    self.logger.info('[ JobEndReached ] Last page reached at # %d' % self.page)
    raise CloseSpider("Last page reached")

Here is the my mistake,

when I manually raise the CloseSpider exception to stop crawling, it stops the crawling that has been requested, but has not started yet.

It was not obvious to be since I experimented and found out that the CloseSpider raise doesn't kill the spider immediately, so I assumed wrongly that if any has been requested before SpiderClose, it will be finished eventually.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM