簡體   English   中英

Scrapy陷入IIS 5.1頁面

[英]Scrapy gets stuck with IIS 5.1 page

我正在草率地編寫蜘蛛程序,以使用ASP從幾個應用程序中獲取一些數據。 這兩個網頁幾乎完全相同,需要先登錄才能開始抓取,但我只設法抓取了其中一個。 在另一種情況下,scrapy會永遠等待某件事,而使用FormRequest方法登錄后再也不會等待。


from scrapy.spider import BaseSpider
from scrapy.http import FormRequest
from scrapy.shell import inspect_response

class MySpider(BaseSpider):
name = "my_very_nice_spider"
allowed_domains = ["xxx.xxx.xxx.xxx"]
start_urls = ['http://xxx.xxx.xxx.xxx/reporting/']

def parse(self,response):
    #Simulate user login on (http://xxx.xxx.xxx.xxx/reporting/)
    return [FormRequest.from_response(response,

def after_login(self,response):
    inspect_response(response,self) #Spider never gets here in one site
    if "Bad login" in response.body:
        print "Login failed"
    #Scrapping code begins...

我想知道它們之間可能有什么不同,所以我使用Firefox Live HTTP標頭檢查標頭,但發現只有一個區別:起作用的網頁使用IIS 6.0,而沒有IIS 5.1。


使用scrapy與工作網頁(IIS 6.0)進行交互

scrapy  --> webpage GET /reporting/ HTTP/1.1
scrapy  <-- webpage HTTP/1.1 200 OK
scrapy  --> webpage POST /reporting/ HTTP/1.1 (application/x-www-form-urlencoded)
scrapy  <-- webpage HTTP/1.1 302 Object moved
scrapy  --> webpage GET /reporting/htm/webpage.asp
scrapy  <-- webpage HTTP/1.1 200 OK
scrapy  --> webpage POST /reporting/asp/report1.asp
...Scrapping begins

使用scrapy與無法正常運行的網頁進行交互(IIS 5.1)

scrapy  --> webpage GET /reporting/ HTTP/1.1
scrapy  <-- webpage HTTP/1.1 200 OK
scrapy  --> webpage POST /reporting/ HTTP/1.1 (application/x-www-form-urlencoded)
scrapy  <-- webpage HTTP/1.1 100 Continue # What the f...?
scrapy  <-- webpage HTTP/1.1 302 Object moved
...Scrapy waits forever...

我在Google上搜索了一下,發現IIS 5.1確實具有某種不錯的“功能”,每當有人對其進行POST時,它就會返回HTTP 100, 如下所示

知道萬惡的根源總是存在,但無論如何都必須廢棄該地點...在這種情況下我該如何進行繁瑣的工作? 還是我做錯了什么?



2014-01-17 09:09:50-0300 [scrapy] INFO: Scrapy 0.20.2 started (bot: mybot)
2014-01-17 09:09:50-0300 [scrapy] DEBUG: Optional features available: ssl, http11
2014-01-17 09:09:50-0300 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': bot.spiders', 'SPIDER_MODULES': [bot.spiders'], 'BOT_NAME': 'bot'}
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Enabled item pipelines:
2014-01-17 09:09:51-0300 [my_very_nice_spider] INFO: Spider opened
2014-01-17 09:09:51-0300 [my_very_nice_spider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Telnet console listening on
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Web service listening on
2014-01-17 09:09:54-0300 [my_very_nice_spider] DEBUG: Crawled (200) <GET http://xxx.xxx.xxx.xxx/reporting/> (referer: None)
2014-01-17 09:10:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 1 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:11:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:12:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:12:54-0300 [my_very_nice_spider] DEBUG: Retrying <POST http://xxx.xxx.xxx.xxx/reporting/> (failed 1 times): User timeout caused connection failure: Getting http://xxx.xxx.xxx.xxx/reporting/ took longer than 180 seconds..
2014-01-17 09:13:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:14:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:15:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:15:54-0300 [my_very_nice_spider] DEBUG: Retrying <POST http://xxx.xxx.xxx.xxx/reporting/> (failed 2 times): User timeout caused connection failure: Getting http://xxx.xxx.xxx.xxx/reporting/ took longer than 180 seconds..

嘗試使用HTTP 1.0下載器:

# settings.py
    'http': 'scrapy.core.downloader.handlers.http10.HTTP10DownloadHandler',
    'https': 'scrapy.core.downloader.handlers.http10.HTTP10DownloadHandler',


聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

粵ICP備18138465號  © 2020-2024 STACKOOM.COM