登录并抓取所有链接时出现抓取问题

Question

我试图让Scrapy登录到网站，然后能够转到该网站的特定页面，然后抓取信息。 我有以下代码：

class DemoSpider(InitSpider):
    name = "demo"
    allowed_domains = ['example.com']
    login_page = "https://www.example.com/"
    start_urls = ["https://www.example.com/secure/example"]

    rules = (Rule(SgmlLinkExtractor(allow=r'\w+'),callback='parse_item', follow=True),)

    # Initialization
    def init_request(self):
        """This function is called before crawling starts."""
        return Request(url=self.login_page, callback=self.login)

    # Perform login with the username and password
    def login(self, response):
        """Generate a login request."""
        return FormRequest.from_response(response,
                    formdata={'name': 'user', 'password': 'password'},
                    callback=self.check_login_response)

    # Check the response after logging in, make sure it went well
    def check_login_response(self, response):
        """Check the response returned by a login request to see if we are
        successfully logged in.
        """
        if "authentication failed" in response.body:
            self.log("Login failed", level=log.ERROR)
            return

        else:
            self.log('will initialize')
            self.initialized(response)

    def parse_item(self, response):
        self.log('got to the parse item page')

每次我运行Spider时，它都会登录并进入初始化状态。 但是，它永远不会匹配规则。 是否有一个原因？ 我对此检查了以下站点：

在Scrapy中通过身份验证的会话进行爬网

还有许多其他站点，包括文档。 为什么在初始化之后，它永远不会经过start_urls ，然后再刮取每个页面？

Answer 1

您不能在InitSpider使用规则。 它仅在crawlspider可用

Answer 2

通过查看其他问题，您似乎需要返回不带参数的self.initialized，即return self.initialized()

登录并抓取所有链接时出现抓取问题

问题描述

2 个解决方案

解决方案1
3 已采纳 2012-12-21 06:22:52

解决方案2
1 2012-12-21 10:07:25

登录并抓取所有链接时出现抓取问题

问题描述

2 个解决方案

解决方案1 3 已采纳 2012-12-21 06:22:52

解决方案2 1 2012-12-21 10:07:25

解决方案1
3 已采纳 2012-12-21 06:22:52

解决方案2
1 2012-12-21 10:07:25