繁体   English   中英

元标记在Scrapy python中不起作用

[英]Meta tag is not working in scrapy python

我正在工作scrapy框架,下面是我的spider.py代码

class Example(BaseSpider):
    name = "example"
    allowed_domains = {"http://www.example.com"}


start_urls = [
    "http://www.example.com/servlet/av/search&SiteName=page1"

]

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    hrefs = hxs.select('//table[@class="knxa"]/tr/td/a/@href').extract()
    # href consists of all href tags and i am copying in to forwarding_hrefs by making them as a string 
    forwarding_hrefs = []
    for i in hrefs:
        forwarding_hrefs.append(i.encode('utf-8'))
    return Request('http://www.example.com/servlet/av/search&SiteName=page2',
                    meta={'forwarding_hrefs': response.meta['forwarding_hrefs']},
                   callback=self.parseJob)    


def parseJob(self, response):
    print response,">>>>>>>>>>>"

结果:

2012-07-18 17:29:15+0530 [example] DEBUG: Crawled (200) <GET http://www.example.com/servlet/av/search&SiteName=page1> (referer: None)
2012-07-18 17:29:15+0530 [MemorialReqionalHospital] ERROR: Spider error processing <GET http://www.example.com/servlet/av/search&SiteName=page2>
    Traceback (most recent call last):
      File "/usr/lib64/python2.7/site-packages/twisted/internet/base.py", line 1167, in mainLoop
        self.runUntilCurrent()
      File "/usr/lib64/python2.7/site-packages/twisted/internet/base.py", line 789, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/usr/lib64/python2.7/site-packages/twisted/internet/defer.py", line 361, in callback
        self._startRunCallbacks(result)
      File "/usr/lib64/python2.7/site-packages/twisted/internet/defer.py", line 455, in _startRunCallbacks
        self._runCallbacks()
    --- <exception caught here> ---
      File "/usr/lib64/python2.7/site-packages/twisted/internet/defer.py", line 542, in _runCallbacks
        current.result = callback(current.result, *args, **kw)
      File "/home/local/user/project/example/example/spiders/example_spider.py", line 36, in parse
        meta={'forwarding_hrefs': response.meta['forwarding_hrefs']},
    exceptions.KeyError: 'forwarding_hrefs'

我正在尝试做的是从中收集所有href标签

http://www.example.com/servlet/av/search&SiteName=page1 

并放入forward_hrefs并在下一个请求中调用此forward_hrefs (想在下一个方法中使用此forward_urls列表)

http://www.example.com/servlet/av/search&SiteName=page2

我还想将来自第2页的href标记添加到forward_urls中,并在这个forward_hrefs循环并产生每个href标记的请求,这是我的想法,但它显示了上述错误,上述代码有什么错误,实际上是meta标记的意思复制项目。 谁能让我知道这是如何将forward_hrefs列表从parse方法复制到parseJob方法的。

最后,我的目的是将forward_hrefs列表从parse方法复制到parseJob方法。

希望我能很好地解释,如果不能,请告诉我。

提前致谢

尚未尝试任何操作,但似乎您在这里遇到错误:

 return Request('http://www.example.com/servlet/av/search&SiteName=page2',
                meta={'forwarding_hrefs': response.meta['forwarding_hrefs']},
                callback=self.parseJob)    

您正在传递response.meta ['forwarding_hrefs'],但此响应不存在

您需要输入:

 return Request('http://www.example.com/servlet/av/search&SiteName=page2',
                meta={'forwarding_hrefs': forwarding_hrefs},
                callback=self.parseJob)  

导致您具有forward_hrefs字段,通过这种方式,您将其发送到meta内部的解析作业中,然后在meta内部,您将能够访问response.meta ['forwarding_hrefs'],因为它将存在于该响应对象中。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM