简体   繁体   English

Scrapy CrawlSpider不遵循特定页面上的链接

[英]Scrapy CrawlSpider isn't following the links on a particular page

I have made a spider to crawl a forum that requires a login. 我已经制作了一个蜘蛛来抓取一个需要登录的论坛。 I start it off on the login page. 我在登录页面上启动它。 The problem occurs with the page that I direct the spider to after the login was successful. 登录成功后,我将蜘蛛指向的页面出现问题。

If I open up my rules to accept all links, the spider successfully follows the links on the login page. 如果我打开我的规则接受所有链接,蜘蛛成功地跟随登录页面上的链接。 However, it doesn't follow any of the links on the page that I feed it using Request(). 但是,它不遵循我使用Request()提供它的页面上的任何链接。 This suggests that it isn't because of screwing up the xpath. 这表明它不是因为搞砸了xpath。

The login appears to work - the page_parse function writes the page source to a text file, and the source is from the page I'm looking for, which can only be reached after logging in. However, the pipeline I have in place to take a screenshot of each page captures the login page, but not this page that I then send it on to. 登录似乎有效 - page_parse函数将页面源写入文本文件,源代码来自我正在寻找的页面,只有登录后才能到达。但是,我已经采取了管道每个页面的屏幕截图都会捕获登录页面,但不会显示我随后将其发送到的页面。

Here is the spider: 这是蜘蛛:

class PLMSpider(CrawlSpider):
    name = 'plm'
    allowed_domains = ["patientslikeme.com"]
    start_urls = [
        "https://www.patientslikeme.com/login"
    ]

    rules = (
        Rule(SgmlLinkExtractor(allow=(r"patientslikeme.com/login")), callback='login_parse', follow=True),
        Rule(SgmlLinkExtractor(restrict_xpaths=("//div[@class='content-section']")), callback='post_parse', follow=False),
        Rule(SgmlLinkExtractor(restrict_xpaths=("//div[@class='pagination']")), callback='page_parse', follow=True),
    )

    def __init__(self, **kwargs):
        ScrapyFileLogObserver(open("debug.log", 'w'), level=logging.DEBUG).start()
        CrawlSpider.__init__(self, **kwargs)

    def post_parse(self, response):
        url = response.url
        log.msg("Post parse attempted for {0}".format(url))
        item = PLMItem()
        item['url'] = url
        return item

    def page_parse(self, response):
        url = response.url
        log.msg("Page parse attempted for {0}".format(url))
        item = PLMItem()
        item['url'] = url
        f = open("body.txt", "w")
        f.write(response.body)
        f.close()
        return item

    def login_parse(self, response):
        log.msg("Login attempted")
        return [FormRequest.from_response(response,
                    formdata={'userlogin[login]': username, 'userlogin[password]': password},
                    callback=self.after_login)]

    def after_login(self, response):
        log.msg("Post login")
        if "Login unsuccessful" in response.body:
            self.log("Login failed", level=log.ERROR)
            return
        else:
            return Request(url="https://www.patientslikeme.com/forum/diabetes2/topics",
               callback=self.page_parse)

And here is my debug log: 这是我的调试日志:

2014-03-21 18:22:05+0000 [scrapy] INFO: Scrapy 0.18.2 started (bot: plm)
2014-03-21 18:22:05+0000 [scrapy] DEBUG: Optional features available: ssl, http11
2014-03-21 18:22:05+0000 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'plm.spiders', 'ITEM_PIPELINES': {'plm.pipelines.ScreenshotPipeline': 1}, 'DEPTH_LIMIT': 5, 'SPIDER_MODULES': ['plm.spiders'], 'BOT_NAME': 'plm', 'DEPTH_PRIORITY': 1, 'SCHEDULER_MEMORY_QUEUE': 'scrapy.squeue.FifoMemoryQueue', 'SCHEDULER_DISK_QUEUE': 'scrapy.squeue.PickleFifoDiskQueue'}
2014-03-21 18:22:05+0000 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-03-21 18:22:06+0000 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-03-21 18:22:06+0000 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-03-21 18:22:06+0000 [scrapy] DEBUG: Enabled item pipelines: ScreenshotPipeline
2014-03-21 18:22:06+0000 [plm] INFO: Spider opened
2014-03-21 18:22:06+0000 [plm] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-03-21 18:22:07+0000 [scrapy] INFO: Screenshooter initiated
2014-03-21 18:22:07+0000 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-03-21 18:22:07+0000 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-03-21 18:22:08+0000 [plm] DEBUG: Crawled (200) <GET https://www.patientslikeme.com/login> (referer: None)
2014-03-21 18:22:08+0000 [plm] DEBUG: Crawled (200) <GET https://www.patientslikeme.com/login> (referer: https://www.patientslikeme.com/login)
2014-03-21 18:22:08+0000 [scrapy] INFO: Login attempted
2014-03-21 18:22:08+0000 [plm] DEBUG: Filtered duplicate request: <GET https://www.patientslikeme.com/login> - no more duplicates will be shown (see DUPEFILTER_CLASS)
2014-03-21 18:22:09+0000 [plm] DEBUG: Redirecting (302) to <GET https://www.patientslikeme.com/profile/activity/all> from <POST https://www.patientslikeme.com/login>
2014-03-21 18:22:10+0000 [plm] DEBUG: Crawled (200) <GET https://www.patientslikeme.com/profile/activity/all> (referer: https://www.patientslikeme.com/login)
2014-03-21 18:22:10+0000 [scrapy] INFO: Post login
2014-03-21 18:22:10+0000 [plm] DEBUG: Crawled (200) <GET https://www.patientslikeme.com/forum/diabetes2/topics> (referer: https://www.patientslikeme.com/profile/activity/all)
2014-03-21 18:22:10+0000 [scrapy] INFO: Page parse attempted for https://www.patientslikeme.com/forum/diabetes2/topics
2014-03-21 18:22:10+0000 [scrapy] INFO: Screenshot attempted for https://www.patientslikeme.com/forum/diabetes2/topics
2014-03-21 18:22:15+0000 [plm] DEBUG: Scraped from <200 https://www.patientslikeme.com/forum/diabetes2/topics>

    {'url': 'https://www.patientslikeme.com/forum/diabetes2/topics'}
2014-03-21 18:22:15+0000 [plm] INFO: Closing spider (finished)
2014-03-21 18:22:15+0000 [plm] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 2068,
     'downloader/request_count': 5,
     'downloader/request_method_count/GET': 4,
     'downloader/request_method_count/POST': 1,
     'downloader/response_bytes': 53246,
     'downloader/response_count': 5,
     'downloader/response_status_count/200': 4,
     'downloader/response_status_count/302': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2014, 3, 21, 18, 22, 15, 177000),
     'item_scraped_count': 1,
     'log_count/DEBUG': 13,
     'log_count/INFO': 8,
     'request_depth_max': 3,
     'response_received_count': 4,
     'scheduler/dequeued': 5,
     'scheduler/dequeued/memory': 5,
     'scheduler/enqueued': 5,
     'scheduler/enqueued/memory': 5,
     'start_time': datetime.datetime(2014, 3, 21, 18, 22, 6, 377000)}
2014-03-21 18:22:15+0000 [plm] INFO: Spider closed (finished)

Thanks for any help you can give. 谢谢你提供的所有帮助。

---- EDIT ---- ----编辑----

I have tried to implement Paul t.'s suggestion. 我试图实施Paul t。的建议。 Unfortunately, I'm getting the following error: 不幸的是,我收到以下错误:

    Traceback (most recent call last):
      File "C:\Python27\lib\site-packages\scrapy\crawler.py", line 93, in start
        if self.start_crawling():
      File "C:\Python27\lib\site-packages\scrapy\crawler.py", line 168, in start_crawling
        return self.start_crawler() is not None
      File "C:\Python27\lib\site-packages\scrapy\crawler.py", line 158, in start_crawler
        crawler.start()
      File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 1213, in unwindGenerator
        return _inlineCallbacks(None, gen, Deferred())
    --- <exception caught here> ---
      File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 1070, in _inlineCallbacks
        result = g.send(result)
      File "C:\Python27\lib\site-packages\scrapy\crawler.py", line 74, in start
        yield self.schedule(spider, batches)
      File "C:\Python27\lib\site-packages\scrapy\crawler.py", line 61, in schedule
        requests.extend(batch)
    exceptions.TypeError: 'Request' object is not iterable

Since it isn't identifying a particular part of the spider that's to blame, I'm struggling to work out where the problem is. 因为它没有确定蜘蛛的特定部分是否应该受到责备,所以我正在努力解决问题所在。

---- EDIT 2 ---- ----编辑2 ----

The problem was being caused by the start_requests function provided by Paul t., which used return rather than yield. 问题是由Paul t。提供的start_requests函数引起的,它使用return而不是yield。 If I change it to yield, it works perfectly. 如果我把它改成屈服,那就完美了。

My suggestion is to trick CrawlSpider with: 我的建议是欺骗CrawlSpider:

  • a manually crafted request to the login page, 手动制作的登录页面请求,
  • performing the login, 执行登录,
  • and only then do as if CrawlSpider was starting with start_urls , using CrawlSpider's "magic" 并且只有这样就像CrawlSpider以start_urls开头一样,使用CrawlSpider的“魔法”

Here's an illustration of that: 这是一个例子:

class PLMSpider(CrawlSpider):
    name = 'plm'
    allowed_domains = ["patientslikeme.com"]

    # pseudo-start_url
    login_url = "https://www.patientslikeme.com/login"

    # start URLs used after login
    start_urls = [
        "https://www.patientslikeme.com/forum/diabetes2/topics",
    ]

    rules = (
        # you want to do the login only once, so it's probably cleaner
        # not to ask the CrawlSpider to follow links to the login page
        #Rule(SgmlLinkExtractor(allow=(r"patientslikeme.com/login")), callback='login_parse', follow=True),

        # you can also deny "/login" to be safe
        Rule(SgmlLinkExtractor(restrict_xpaths=("//div[@class='content-section']"),
                               deny=('/login',)),
             callback='post_parse', follow=False),

        Rule(SgmlLinkExtractor(restrict_xpaths=("//div[@class='pagination']"),
                               deny=('/login',)),
             callback='page_parse', follow=True),
    )

    def __init__(self, **kwargs):
        ScrapyFileLogObserver(open("debug.log", 'w'), level=logging.DEBUG).start()
        CrawlSpider.__init__(self, **kwargs)

    # by default start_urls pages will be sent to the parse method,
    # but parse() is rather special in CrawlSpider
    # so I suggest you create your own initial login request "manually"
    # and ask for it to be parsed by your specific callback
    def start_requests(self):
        yield Request(self.login_url, callback=self.login_parse)

    # you've got the login page, send credentials
    # (so far so good...)
    def login_parse(self, response):
        log.msg("Login attempted")
        return [FormRequest.from_response(response,
                    formdata={'userlogin[login]': username, 'userlogin[password]': password},
                    callback=self.after_login)]

    # so we got a response to the login thing
    # if we're good,
    # just do as if we were starting the crawl now,
    # basically doing what happens when you use start_urls
    def after_login(self, response):
        log.msg("Post login")
        if "Login unsuccessful" in response.body:
            self.log("Login failed", level=log.ERROR)
            return
        else:
            return [Request(url=u) for u in self.start_urls]
            # alternatively, you could even call CrawlSpider's start_requests() method directly
            # that's probably cleaner
            #return super(PLMSpider, self).start_requests()

    def post_parse(self, response):
        url = response.url
        log.msg("Post parse attempted for {0}".format(url))
        item = PLMItem()
        item['url'] = url
        return item

    def page_parse(self, response):
        url = response.url
        log.msg("Page parse attempted for {0}".format(url))
        item = PLMItem()
        item['url'] = url
        f = open("body.txt", "w")
        f.write(response.body)
        f.close()
        return item

    # if you want the start_urls pages to be parsed,
    # you need to tell CrawlSpider to do so by defining parse_start_url attribute
    # https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/spiders/crawl.py#L38
    parse_start_url = page_parse

Your login page is parsed by method parse_start_url . 您的登录页面由方法parse_start_url解析。 You should redefine the method to parse the login page. 您应该重新定义解析登录页面的方法。 Have a look at documentation . 看看文档

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM