[英]Scrapy issue when logging in and scraping all links
我试图让Scrapy登录到网站,然后能够转到该网站的特定页面,然后抓取信息。 我有以下代码:
class DemoSpider(InitSpider):
name = "demo"
allowed_domains = ['example.com']
login_page = "https://www.example.com/"
start_urls = ["https://www.example.com/secure/example"]
rules = (Rule(SgmlLinkExtractor(allow=r'\w+'),callback='parse_item', follow=True),)
# Initialization
def init_request(self):
"""This function is called before crawling starts."""
return Request(url=self.login_page, callback=self.login)
# Perform login with the username and password
def login(self, response):
"""Generate a login request."""
return FormRequest.from_response(response,
formdata={'name': 'user', 'password': 'password'},
callback=self.check_login_response)
# Check the response after logging in, make sure it went well
def check_login_response(self, response):
"""Check the response returned by a login request to see if we are
successfully logged in.
"""
if "authentication failed" in response.body:
self.log("Login failed", level=log.ERROR)
return
else:
self.log('will initialize')
self.initialized(response)
def parse_item(self, response):
self.log('got to the parse item page')
每次我运行Spider时,它都会登录并进入初始化状态。 但是,它永远不会匹配规则。 是否有一个原因? 我对此检查了以下站点:
还有许多其他站点,包括文档。 为什么在初始化之后,它永远不会经过start_urls
,然后再刮取每个页面?
您不能在InitSpider
使用规则。 它仅在crawlspider
可用
通过查看其他问题,您似乎需要返回不带参数的self.initialized,即return self.initialized()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.