简体   繁体   English

ScraPy蜘蛛爬行但不导出

[英]ScraPy spider crawling but not exporting

I have a ScraPy Code that is running in shell, but when I try to export it to csv, it returns an empty file. 我有一个在外壳程序中运行的ScraPy代码,但是当我尝试将其导出到csv时,它将返回一个空文件。 It exports data when I do not go into a link and try to parse the description, but once I add the extra method of parsing the contents, it fails to work. 当我不进入链接并尝试解析描述时,它将导出数据,但是一旦我添加了解析内容的额外方法,它便无法工作。 Here is the code: 这是代码:

class MonsterSpider(CrawlSpider):
    name = "monster"
    allowed_domains = ["jobs.monster.com"]
    base_url = "http://jobs.monster.com/v-technology.aspx?"
    start_urls = [
        "http://jobs.monster.com/v-technology.aspx"
    ]
    for i in range(1,5):
        start_urls.append(base_url + "page=" + str(i))

    rules = (Rule(SgmlLinkExtractor(allow=("jobs.monster.com",))
         , callback = 'parse_items'),)

    def parse_items(self, response):
        sel = Selector(response)
        sites = sel.xpath('//div[@class="col-xs-12"]')

        #items = []

        for site in sites.xpath('.//article[@class="js_result_row"]'):
            item = MonsterItem()
            item['title'] = site.xpath('.//span[@itemprop = "title"]/text()').extract()
            item['company'] = site.xpath('.//span[@itemprop = "name"]/text()').extract()
            item['city'] = site.xpath('.//span[@itemprop = "addressLocality"]/text()').extract()
            item['state'] = site.xpath('.//span[@itemprop = "addressRegion"]/text()').extract()
            item['link'] = site.xpath('.//a[@data-m_impr_a_placement_id= "jsr"]/@href').extract()
            follow = ''.join(item["link"])
            request = Request(follow, callback = self.parse_dir_contents)
            request.meta["item"] =  item
            yield request
            #items.append(item)
            #return items

    def parse_dir_contents(self, response):
        item = response.meta["item"]
        item['desc'] = site.xpath('.//div[@itemprop = "description"]/text()').extract()
        return item

Taking out the parse_dir_contents and uncommenting the empty "lists" list and "append" code was the original code. 删除parse_dir_contents并取消注释空的“列表”列表和“追加”代码是原始代码。

Well, as @tayfun suggests you should use response.xpath or define the site variable. 好吧,正如@tayfun建议的那样,您应该使用response.xpath或定义site变量。

By the way, you do not need to use sel = Selector(response) . 顺便说一句,您不需要使用sel = Selector(response) Responses come with the xpath function, there is no need to cover it into another selector. 响应是xpath函数附带的,无需将其覆盖到另一个选择器中。

However the main problem is that you restrict the domain of the spider. 但是,主要问题是您限制了蜘蛛网的域。 You define allowed_domains = ["jobs.monster.com"] however if you look at the URL to follow of your custom Request you can see that they are something like http://jobview.monster.com/ or http://job-openings.monster.com . 您可以定义allowed_domains = ["jobs.monster.com"]但是如果你看一下网址follow您的自定义的Request ,你可以看到,他们是像http://jobview.monster.com/http://job-openings.monster.com In this case your parse_dir_contents is not executed (the domain is not allowed) and your item does not get returned so you won't get any results. 在这种情况下,不会执行parse_dir_contents (不允许使用域),并且不会返回您的item ,因此您不会得到任何结果。

Change allowed_domains = ["jobs.monster.com"] to allowed_domains = ["jobs.monster.com"]更改为

allowed_domains = ["monster.com"]

and you will be fine and your app will work and return items. 一切都会好起来的,您的应用将可以正常工作并退货。

You have an error in your parse_dir_contents method: 您的parse_dir_contents方法中有一个错误:

def parse_dir_contents(self, response):
    item = response.meta["item"]
    item['desc'] = response.xpath('.//div[@itemprop=description"]/text()').extract()
    return item

Note the use of response . 注意使用response I don't know where you got site that you are currently using from. 我不知道您当前从何处获得site

Also, try to provide the error details when you post a question. 另外,在发布问题时,请尝试提供错误详细信息。 Writing "it fails to work" doesn't say much. 写“行不通”并没有多说。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM