[英]Scrapy Crawler in python cannot follow links?
我使用python的scrapy工具在python中編寫了一個搜尋器。 以下是python代碼:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
#from scrapy.item import Item
from a11ypi.items import AYpiItem
class AYpiSpider(CrawlSpider):
name = "AYpi"
allowed_domains = ["a11y.in"]
start_urls = ["http://a11y.in/a11ypi/idea/firesafety.html"]
rules =(
Rule(SgmlLinkExtractor(allow = ()) ,callback = 'parse_item')
)
def parse_item(self,response):
#filename = response.url.split("/")[-1]
#open(filename,'wb').write(response.body)
#testing codes ^ (the above)
hxs = HtmlXPathSelector(response)
item = AYpiItem()
item["foruri"] = hxs.select("//@foruri").extract()
item["thisurl"] = response.url
item["thisid"] = hxs.select("//@foruri/../@id").extract()
item["rec"] = hxs.select("//@foruri/../@rec").extract()
return item
但是,引發的錯誤不是跟隨鏈接,而是:
Traceback (most recent call last):
File "/usr/lib/python2.6/site-packages/Scrapy-0.12.0.2538-py2.6.egg/scrapy/cmdline.py", line 131, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "/usr/lib/python2.6/site-packages/Scrapy-0.12.0.2538-py2.6.egg/scrapy/cmdline.py", line 97, in _run_print_help
func(*a, **kw)
File "/usr/lib/python2.6/site-packages/Scrapy-0.12.0.2538-py2.6.egg/scrapy/cmdline.py", line 138, in _run_command
cmd.run(args, opts)
File "/usr/lib/python2.6/site-packages/Scrapy-0.12.0.2538-py2.6.egg/scrapy/commands/crawl.py", line 45, in run
q.append_spider_name(name, **opts.spargs)
--- <exception caught here> ---
File "/usr/lib/python2.6/site-packages/Scrapy-0.12.0.2538-py2.6.egg/scrapy/queue.py", line 89, in append_spider_name
spider = self._spiders.create(name, **spider_kwargs)
File "/usr/lib/python2.6/site-packages/Scrapy-0.12.0.2538-py2.6.egg/scrapy/spidermanager.py", line 36, in create
return self._spiders[spider_name](**spider_kwargs)
File "/usr/lib/python2.6/site-packages/Scrapy-0.12.0.2538-py2.6.egg/scrapy/contrib/spiders/crawl.py", line 38, in __init__
self._compile_rules()
File "/usr/lib/python2.6/site-packages/Scrapy-0.12.0.2538-py2.6.egg/scrapy/contrib/spiders/crawl.py", line 82, in _compile_rules
self._rules = [copy.copy(r) for r in self.rules]
exceptions.TypeError: 'Rule' object is not iterable
有人可以告訴我發生了什么嗎? 由於這是文檔中提到的內容,並且我將allow字段留為空白,因此默認情況下,其本身應遵循True。 那么為什么會出錯呢? 我可以對自己的履帶進行快速優化嗎?
從我的角度來看,您的規則似乎不是可重復的。 看來您正在嘗試將規則設為元組,應該在python文檔中閱讀元組 。
要解決您的問題,請更改以下行:
rules =(
Rule(SgmlLinkExtractor(allow = ()) ,callback = 'parse_item')
)
至:
rules =(Rule(SgmlLinkExtractor(allow = ()) ,callback = 'parse_item'),)
注意結尾的逗號嗎?
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.