繁体   English   中英

Scrapy陷入IIS 5.1页面

[英]Scrapy gets stuck with IIS 5.1 page

我正在草率地编写蜘蛛程序,以使用ASP从几个应用程序中获取一些数据。 这两个网页几乎完全相同,需要先登录才能开始抓取,但我只设法抓取了其中一个。 在另一种情况下,scrapy会永远等待某件事,而使用FormRequest方法登录后再也不会等待。

两个蜘蛛的代码(它们几乎相同,但具有不同的IP)如下:

from scrapy.spider import BaseSpider
from scrapy.http import FormRequest
from scrapy.shell import inspect_response

class MySpider(BaseSpider):
name = "my_very_nice_spider"
allowed_domains = ["xxx.xxx.xxx.xxx"]
start_urls = ['http://xxx.xxx.xxx.xxx/reporting/']

def parse(self,response):
    #Simulate user login on (http://xxx.xxx.xxx.xxx/reporting/)
    return [FormRequest.from_response(response,
                                      formdata={'user':'the_username',
                                                'password':'my_nice_password'},
                                      callback=self.after_login)]

def after_login(self,response):
    inspect_response(response,self) #Spider never gets here in one site
    if "Bad login" in response.body:
        print "Login failed"
        return
    #Scrapping code begins...

我想知道它们之间可能有什么不同,所以我使用Firefox Live HTTP标头检查标头,但发现只有一个区别:起作用的网页使用IIS 6.0,而没有IIS 5.1。

由于仅此一项无法解释我自己一个人为什么工作而另一个人不能工作的原因,因此我使用Wireshark捕获网络流量并发现了这一点:

使用scrapy与工作网页(IIS 6.0)进行交互

scrapy  --> webpage GET /reporting/ HTTP/1.1
scrapy  <-- webpage HTTP/1.1 200 OK
scrapy  --> webpage POST /reporting/ HTTP/1.1 (application/x-www-form-urlencoded)
scrapy  <-- webpage HTTP/1.1 302 Object moved
scrapy  --> webpage GET /reporting/htm/webpage.asp
scrapy  <-- webpage HTTP/1.1 200 OK
scrapy  --> webpage POST /reporting/asp/report1.asp
...Scrapping begins

使用scrapy与无法正常运行的网页进行交互(IIS 5.1)

scrapy  --> webpage GET /reporting/ HTTP/1.1
scrapy  <-- webpage HTTP/1.1 200 OK
scrapy  --> webpage POST /reporting/ HTTP/1.1 (application/x-www-form-urlencoded)
scrapy  <-- webpage HTTP/1.1 100 Continue # What the f...?
scrapy  <-- webpage HTTP/1.1 302 Object moved
...Scrapy waits forever...

我在Google上搜索了一下,发现IIS 5.1确实具有某种不错的“功能”,每当有人对其进行POST时,它就会返回HTTP 100, 如下所示

知道万恶的根源总是存在,但无论如何都必须废弃该地点...在这种情况下我该如何进行繁琐的工作? 还是我做错了什么?

谢谢!

编辑-控制台日志中没有工作站点:

2014-01-17 09:09:50-0300 [scrapy] INFO: Scrapy 0.20.2 started (bot: mybot)
2014-01-17 09:09:50-0300 [scrapy] DEBUG: Optional features available: ssl, http11
2014-01-17 09:09:50-0300 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': bot.spiders', 'SPIDER_MODULES': [bot.spiders'], 'BOT_NAME': 'bot'}
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Enabled item pipelines:
2014-01-17 09:09:51-0300 [my_very_nice_spider] INFO: Spider opened
2014-01-17 09:09:51-0300 [my_very_nice_spider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-01-17 09:09:54-0300 [my_very_nice_spider] DEBUG: Crawled (200) <GET http://xxx.xxx.xxx.xxx/reporting/> (referer: None)
2014-01-17 09:10:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 1 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:11:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:12:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:12:54-0300 [my_very_nice_spider] DEBUG: Retrying <POST http://xxx.xxx.xxx.xxx/reporting/> (failed 1 times): User timeout caused connection failure: Getting http://xxx.xxx.xxx.xxx/reporting/ took longer than 180 seconds..
2014-01-17 09:13:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:14:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:15:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:15:54-0300 [my_very_nice_spider] DEBUG: Retrying <POST http://xxx.xxx.xxx.xxx/reporting/> (failed 2 times): User timeout caused connection failure: Getting http://xxx.xxx.xxx.xxx/reporting/ took longer than 180 seconds..
...

尝试使用HTTP 1.0下载器:

# settings.py
DOWNLOAD_HANDLERS = {
    'http': 'scrapy.core.downloader.handlers.http10.HTTP10DownloadHandler',
    'https': 'scrapy.core.downloader.handlers.http10.HTTP10DownloadHandler',
}

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM