简体   繁体   中英

ScraPy spider crawling but not exporting

I have a ScraPy Code that is running in shell, but when I try to export it to csv, it returns an empty file. It exports data when I do not go into a link and try to parse the description, but once I add the extra method of parsing the contents, it fails to work. Here is the code:

class MonsterSpider(CrawlSpider):
    name = "monster"
    allowed_domains = ["jobs.monster.com"]
    base_url = "http://jobs.monster.com/v-technology.aspx?"
    start_urls = [
        "http://jobs.monster.com/v-technology.aspx"
    ]
    for i in range(1,5):
        start_urls.append(base_url + "page=" + str(i))

    rules = (Rule(SgmlLinkExtractor(allow=("jobs.monster.com",))
         , callback = 'parse_items'),)

    def parse_items(self, response):
        sel = Selector(response)
        sites = sel.xpath('//div[@class="col-xs-12"]')

        #items = []

        for site in sites.xpath('.//article[@class="js_result_row"]'):
            item = MonsterItem()
            item['title'] = site.xpath('.//span[@itemprop = "title"]/text()').extract()
            item['company'] = site.xpath('.//span[@itemprop = "name"]/text()').extract()
            item['city'] = site.xpath('.//span[@itemprop = "addressLocality"]/text()').extract()
            item['state'] = site.xpath('.//span[@itemprop = "addressRegion"]/text()').extract()
            item['link'] = site.xpath('.//a[@data-m_impr_a_placement_id= "jsr"]/@href').extract()
            follow = ''.join(item["link"])
            request = Request(follow, callback = self.parse_dir_contents)
            request.meta["item"] =  item
            yield request
            #items.append(item)
            #return items

    def parse_dir_contents(self, response):
        item = response.meta["item"]
        item['desc'] = site.xpath('.//div[@itemprop = "description"]/text()').extract()
        return item

Taking out the parse_dir_contents and uncommenting the empty "lists" list and "append" code was the original code.

Well, as @tayfun suggests you should use response.xpath or define the site variable.

By the way, you do not need to use sel = Selector(response) . Responses come with the xpath function, there is no need to cover it into another selector.

However the main problem is that you restrict the domain of the spider. You define allowed_domains = ["jobs.monster.com"] however if you look at the URL to follow of your custom Request you can see that they are something like http://jobview.monster.com/ or http://job-openings.monster.com . In this case your parse_dir_contents is not executed (the domain is not allowed) and your item does not get returned so you won't get any results.

Change allowed_domains = ["jobs.monster.com"] to

allowed_domains = ["monster.com"]

and you will be fine and your app will work and return items.

You have an error in your parse_dir_contents method:

def parse_dir_contents(self, response):
    item = response.meta["item"]
    item['desc'] = response.xpath('.//div[@itemprop=description"]/text()').extract()
    return item

Note the use of response . I don't know where you got site that you are currently using from.

Also, try to provide the error details when you post a question. Writing "it fails to work" doesn't say much.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM