登录后开始抓取

Question

Disclaimer: The site I am crawling is a corporate intranet and I modified the url a bit for corporate privacy. 免责声明：我正在爬网的站点是公司的Intranet，我为公司隐私做了一些修改。

I managed to log into the site but I have failed to crawl the site. 我设法登录到该站点，但是未能爬网该站点。

Start from start_url https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf ( this site would direct you to a similar site with more complex url : 从start_url https：//kmssqkr.sarg/LotusQuickr/dept/Main.nsf开始（此站点会将您定向到具有更复杂url的类似站点：

ie 即

https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf/h_RoomHome/ade682e34fc59d274825770b0037d278/?OpenDocument {unid=ADE682E34FC59D274825770B0037D278}) https：//kmssqkr.sarg/LotusQuickr/dept/Main.nsf/h_RoomHome/ade682e34fc59d274825770b0037d278/？OpenDocument {unid = ADE682E34FC59D274825770B0037D278}）

for every page including the start_url , I want to crawl all href found under //li/<a> ( For every page it crawled, there would be abundant number of hyperlinks available, and some of them will duplicate because you can access both the parent and children sites on the same page. 对于每个包含start_url页面，我都希望对//li/<a>下找到的所有href进行爬网（对于所爬网的每个页面，将有大量可用的超链接，并且其中一些将重复，因为您可以访问两个父网站和子网站在同一页面上。

As you may see, the href does not composite the actual link ( the link quoted above) we see when we crawl into that page. 如您所见， href不会合成我们在爬入该页面时看到的实际链接（上面引用的链接）。 There is also a # in front of its useful content. 有用的内容前面还有一个＃。 Would it be the source of problem? 这会成为问题的根源吗？

For restricted_xpaths ,I have restricted the path to 'logout' the page. 对于restricted_xpaths ，我将路径限制为“注销”页面。

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.http import Request, FormRequest
from scrapy.linkextractors import LinkExtractor
import scrapy

class kmssSpider(CrawlSpider):
    name='kmss'
    start_url = ('https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf',)
    login_page = 'https://kmssqkr.ccgo.sarg/LotusQuickr/dept/Main.nsf?OpenDatabase&Login'
    allowed_domain = ["kmssqkr.sarg"]

    rules= (Rule(LinkExtractor(allow=(r'https://kmssqkr.sarg/LotusQuickr/dept/\w*'),restrict_xpaths=('//*[@id="quickr_widgets_misc_loginlogout_0"]/a'),unique= True),
                  callback='parse_item', follow = True),
                                )
#    r"LotusQuickr/dept/^[ A-Za-z0-9_@./#&+-]*$"
#    restrict_xpaths=('//*[@id="quickr_widgets_misc_loginlogout_0"]/a'),unique = True)

    def start_requests(self):
        yield Request(url=self.login_page, callback=self.login ,dont_filter = True
                )
    def login(self,response):
        return FormRequest.from_response(response,formdata={'user':'user','password':'pw'},
                                        callback = self.check_login_response)

    def check_login_response(self,response):
        if 'Welcome' in response.body:
            self.log("\n\n\n\n Successfuly Logged in \n\n\n ")
            yield Request(url=self.start_url[0])
        else:
            self.log("\n\n You are not logged in \n\n " )

    def parse_item(self, response):
        self.log('Hi, this is an item page! %s' % response.url)
        pass

Log: 日志：

2015-07-27 16:46:18 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-07-27 16:46:18 [boto] DEBUG: Retrieving credentials from metadata server.
2015-07-27 16:46:19 [boto] ERROR: Caught exception reading instance data
Traceback (most recent call last):
  File "C:\Users\hi\AppData\Local\Continuum\Anaconda\lib\site-packages\boto\utils.py", line 210, in retry_url
    r = opener.open(req, timeout=timeout)
  File "C:\Users\hi\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 431, in open
    response = self._open(req, data)
  File "C:\Users\hi\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 449, in _open
    '_open', req)
  File "C:\Users\hi\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 409, in _call_chain
    result = func(*args)
  File "C:\Users\hi\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 1227, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "C:\Users\hi\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 1197, in do_open
    raise URLError(err)
URLError: <urlopen error timed out>
2015-07-27 16:46:19 [boto] ERROR: Unable to read instance data, giving up
2015-07-27 16:46:19 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, HttpProxyMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-07-27 16:46:19 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-07-27 16:46:19 [scrapy] INFO: Enabled item pipelines: 
2015-07-27 16:46:19 [scrapy] INFO: Spider opened
2015-07-27 16:46:19 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-07-27 16:46:19 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-07-27 16:46:24 [scrapy] DEBUG: Crawled (200) <GET https://kmssqkr.ccgo.sarg/LotusQuickr/dept/Main.nsf?OpenDatabase&Login> (referer: None)
2015-07-27 16:46:28 [scrapy] DEBUG: Crawled (200) <POST https://kmssqkr.ccgo.sarg/names.nsf?Login> (referer: https://kmssqkr.ccgo.sarg/LotusQuickr/dept/Main.nsf?OpenDatabase&Login)
2015-07-27 16:46:29 [kmss] DEBUG: 



 Successfuly Logged in 



2015-07-27 16:46:29 [scrapy] DEBUG: Redirecting (302) to <GET https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf/h_Toc/d0a58cff88e9100b852572c300517498/?OpenDocument> from <GET https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf>
2015-07-27 16:46:29 [scrapy] DEBUG: Redirecting (302) to <GET https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf/h_RoomHome/ade682e34fc59d274825770b0037d278/?OpenDocument> from <GET https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf/h_Toc/d0a58cff88e9100b852572c300517498/?OpenDocument>
2015-07-27 16:46:29 [scrapy] DEBUG: Crawled (200) <GET https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf/h_RoomHome/ade682e34fc59d274825770b0037d278/?OpenDocument> (referer: https://kmssqkr.sarg/names.nsf?Login)
2015-07-27 16:46:29 [scrapy] INFO: Closing spider (finished)
2015-07-27 16:46:29 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1954,
 'downloader/request_count': 5,
 'downloader/request_method_count/GET': 4,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 31259,
 'downloader/response_count': 5,
 'downloader/response_status_count/200': 3,
 'downloader/response_status_count/302': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 7, 27, 8, 46, 29, 286000),
 'log_count/DEBUG': 8,
 'log_count/ERROR': 2,
 'log_count/INFO': 7,
 'log_count/WARNING': 1,
 'request_depth_max': 2,
 'response_received_count': 3,
 'scheduler/dequeued': 5,
 'scheduler/dequeued/memory': 5,
 'scheduler/enqueued': 5,
 'scheduler/enqueued/memory': 5,
 'start_time': datetime.datetime(2015, 7, 27, 8, 46, 19, 528000)}
2015-07-27 16:46:29 [scrapy] INFO: Spider closed (finished)

  [1]: http://i.stack.imgur.com/REQXJ.png

----------------------------------UPDATED--------------------------------------- - - - - - - - - - - - - - - - - - 更新 - - - - - - - - ------------------------

I saw the cookies format in http://doc.scrapy.org/en/latest/topics/request-response.html . 我在http://doc.scrapy.org/en/latest/topics/request-response.html中看到了cookie格式。 These are my cookies on the site, but I am not sure what and How I should add them along with Request. 这些是我在网站上的cookie，但是我不确定应如何以及如何将它们与Request一起添加。

Answer 1

First of all do not be demanding, sometimes I get angry and won't answer your question. 首先不要苛求，有时我会生气，不会回答您的问题。

To see which cookies are sent with your Request enable debugging with COOKIES_DEBUG = True . 要查看随Request发送的cookie，请使用COOKIES_DEBUG = True启用调试。

Then you will notice that cookies are not sent even if Scrapy's middleware should send those cookies. 然后，您会注意到即使Scrapy的中间件应该发送这些cookie，也不会发送cookie。 I think this is because you yield a custom request and Scrapy won't be more clever than you and accepts your solution to send this request without cookies. 我认为这是因为您yield了一个自定义请求，并且Scrapy不会比您更聪明，并且接受您的解决方案来发送不带Cookie的请求。

This means you need to access the cookies from the response and add the required ones (or all) to your Request . 这意味着您需要访问response的cookie，并将所需的cookie（或全部）添加到Request 。

登录后开始抓取

问题描述

1 个解决方案

解决方案1
3 2015-07-27 12:51:12

登录后开始抓取

问题描述

1 个解决方案

解决方案1 3 2015-07-27 12:51:12

解决方案1
3 2015-07-27 12:51:12