[英]ScraPy spider crawling but not exporting
I have a ScraPy Code that is running in shell, but when I try to export it to csv, it returns an empty file. 我有一个在外壳程序中运行的ScraPy代码,但是当我尝试将其导出到csv时,它将返回一个空文件。 It exports data when I do not go into a link and try to parse the description, but once I add the extra method of parsing the contents, it fails to work. 当我不进入链接并尝试解析描述时,它将导出数据,但是一旦我添加了解析内容的额外方法,它便无法工作。 Here is the code: 这是代码:
class MonsterSpider(CrawlSpider):
name = "monster"
allowed_domains = ["jobs.monster.com"]
base_url = "http://jobs.monster.com/v-technology.aspx?"
start_urls = [
"http://jobs.monster.com/v-technology.aspx"
]
for i in range(1,5):
start_urls.append(base_url + "page=" + str(i))
rules = (Rule(SgmlLinkExtractor(allow=("jobs.monster.com",))
, callback = 'parse_items'),)
def parse_items(self, response):
sel = Selector(response)
sites = sel.xpath('//div[@class="col-xs-12"]')
#items = []
for site in sites.xpath('.//article[@class="js_result_row"]'):
item = MonsterItem()
item['title'] = site.xpath('.//span[@itemprop = "title"]/text()').extract()
item['company'] = site.xpath('.//span[@itemprop = "name"]/text()').extract()
item['city'] = site.xpath('.//span[@itemprop = "addressLocality"]/text()').extract()
item['state'] = site.xpath('.//span[@itemprop = "addressRegion"]/text()').extract()
item['link'] = site.xpath('.//a[@data-m_impr_a_placement_id= "jsr"]/@href').extract()
follow = ''.join(item["link"])
request = Request(follow, callback = self.parse_dir_contents)
request.meta["item"] = item
yield request
#items.append(item)
#return items
def parse_dir_contents(self, response):
item = response.meta["item"]
item['desc'] = site.xpath('.//div[@itemprop = "description"]/text()').extract()
return item
Taking out the parse_dir_contents and uncommenting the empty "lists" list and "append" code was the original code. 删除parse_dir_contents并取消注释空的“列表”列表和“追加”代码是原始代码。
Well, as @tayfun suggests you should use response.xpath
or define the site
variable. 好吧,正如@tayfun建议的那样,您应该使用response.xpath
或定义site
变量。
By the way, you do not need to use sel = Selector(response)
. 顺便说一句,您不需要使用sel = Selector(response)
。 Responses come with the xpath
function, there is no need to cover it into another selector. 响应是xpath
函数附带的,无需将其覆盖到另一个选择器中。
However the main problem is that you restrict the domain of the spider. 但是,主要问题是您限制了蜘蛛网的域。 You define allowed_domains = ["jobs.monster.com"]
however if you look at the URL to follow
of your custom Request
you can see that they are something like http://jobview.monster.com/
or http://job-openings.monster.com
. 您可以定义allowed_domains = ["jobs.monster.com"]
但是如果你看一下网址follow
您的自定义的Request
,你可以看到,他们是像http://jobview.monster.com/
或http://job-openings.monster.com
。 In this case your parse_dir_contents
is not executed (the domain is not allowed) and your item
does not get returned so you won't get any results. 在这种情况下,不会执行parse_dir_contents
(不允许使用域),并且不会返回您的item
,因此您不会得到任何结果。
Change allowed_domains = ["jobs.monster.com"]
to 将allowed_domains = ["jobs.monster.com"]
更改为
allowed_domains = ["monster.com"]
and you will be fine and your app will work and return items. 一切都会好起来的,您的应用将可以正常工作并退货。
You have an error in your parse_dir_contents
method: 您的parse_dir_contents
方法中有一个错误:
def parse_dir_contents(self, response):
item = response.meta["item"]
item['desc'] = response.xpath('.//div[@itemprop=description"]/text()').extract()
return item
Note the use of response
. 注意使用response
。 I don't know where you got site
that you are currently using from. 我不知道您当前从何处获得site
。
Also, try to provide the error details when you post a question. 另外,在发布问题时,请尝试提供错误详细信息。 Writing "it fails to work" doesn't say much. 写“行不通”并没有多说。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.