简体   繁体   English

Scrapy陷入IIS 5.1页面

[英]Scrapy gets stuck with IIS 5.1 page

I'm writing spiders with scrapy to get some data from a couple of applications using ASP. 我正在草率地编写蜘蛛程序,以使用ASP从几个应用程序中获取一些数据。 Both webpages are almost identical and requires to log in before starting scrapping, but I only managed to scrap one of them. 这两个网页几乎完全相同,需要先登录才能开始抓取,但我只设法抓取了其中一个。 In the other one scrapy gets waiting something forever and never gets after the login using FormRequest method. 在另一种情况下,scrapy会永远等待某件事,而使用FormRequest方法登录后再也不会等待。

The code of both spiders (they are almost identical but with different IPs) is as following: 两个蜘蛛的代码(它们几乎相同,但具有不同的IP)如下:

from scrapy.spider import BaseSpider
from scrapy.http import FormRequest
from scrapy.shell import inspect_response

class MySpider(BaseSpider):
name = "my_very_nice_spider"
allowed_domains = ["xxx.xxx.xxx.xxx"]
start_urls = ['http://xxx.xxx.xxx.xxx/reporting/']

def parse(self,response):
    #Simulate user login on (http://xxx.xxx.xxx.xxx/reporting/)
    return [FormRequest.from_response(response,
                                      formdata={'user':'the_username',
                                                'password':'my_nice_password'},
                                      callback=self.after_login)]

def after_login(self,response):
    inspect_response(response,self) #Spider never gets here in one site
    if "Bad login" in response.body:
        print "Login failed"
        return
    #Scrapping code begins...

Wondering what could be different between them I used Firefox Live HTTP Headers for inspecting the headers and found only one difference: the webpage that works uses IIS 6.0 and the one that doesn't IIS 5.1. 我想知道它们之间可能有什么不同,所以我使用Firefox Live HTTP标头检查标头,但发现只有一个区别:起作用的网页使用IIS 6.0,而没有IIS 5.1。

As this alone couldn't explain myself why one works and the other doesnt' I used Wireshark to capture network traffic and found this: 由于仅此一项无法解释我自己一个人为什么工作而另一个人不能工作的原因,因此我使用Wireshark捕获网络流量并发现了这一点:

Interaction using scrapy with working webpage (IIS 6.0) 使用scrapy与工作网页(IIS 6.0)进行交互

scrapy  --> webpage GET /reporting/ HTTP/1.1
scrapy  <-- webpage HTTP/1.1 200 OK
scrapy  --> webpage POST /reporting/ HTTP/1.1 (application/x-www-form-urlencoded)
scrapy  <-- webpage HTTP/1.1 302 Object moved
scrapy  --> webpage GET /reporting/htm/webpage.asp
scrapy  <-- webpage HTTP/1.1 200 OK
scrapy  --> webpage POST /reporting/asp/report1.asp
...Scrapping begins

Interaction using scrapy with not working webpage (IIS 5.1) 使用scrapy与无法正常运行的网页进行交互(IIS 5.1)

scrapy  --> webpage GET /reporting/ HTTP/1.1
scrapy  <-- webpage HTTP/1.1 200 OK
scrapy  --> webpage POST /reporting/ HTTP/1.1 (application/x-www-form-urlencoded)
scrapy  <-- webpage HTTP/1.1 100 Continue # What the f...?
scrapy  <-- webpage HTTP/1.1 302 Object moved
...Scrapy waits forever...

I googled a little bit and found that indeed IIS 5.1 has some nice kind of "feature" that makes it return HTTP 100 whenever someone makes a POST to it as shown here . 我在Google上搜索了一下,发现IIS 5.1确实具有某种不错的“功能”,每当有人对其进行POST时,它就会返回HTTP 100, 如下所示

Knowing that the root of all evil is where always is, but having to scrap that site anyway... How can I make scrapy work in this situation? 知道万恶的根源总是存在,但无论如何都必须废弃该地点...在这种情况下我该如何进行繁琐的工作? Or am I doing something wrong? 还是我做错了什么?

Thank you! 谢谢!

Edit - Console log with not working site: 编辑-控制台日志中没有工作站点:

2014-01-17 09:09:50-0300 [scrapy] INFO: Scrapy 0.20.2 started (bot: mybot)
2014-01-17 09:09:50-0300 [scrapy] DEBUG: Optional features available: ssl, http11
2014-01-17 09:09:50-0300 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': bot.spiders', 'SPIDER_MODULES': [bot.spiders'], 'BOT_NAME': 'bot'}
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Enabled item pipelines:
2014-01-17 09:09:51-0300 [my_very_nice_spider] INFO: Spider opened
2014-01-17 09:09:51-0300 [my_very_nice_spider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-01-17 09:09:51-0300 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-01-17 09:09:54-0300 [my_very_nice_spider] DEBUG: Crawled (200) <GET http://xxx.xxx.xxx.xxx/reporting/> (referer: None)
2014-01-17 09:10:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 1 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:11:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:12:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:12:54-0300 [my_very_nice_spider] DEBUG: Retrying <POST http://xxx.xxx.xxx.xxx/reporting/> (failed 1 times): User timeout caused connection failure: Getting http://xxx.xxx.xxx.xxx/reporting/ took longer than 180 seconds..
2014-01-17 09:13:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:14:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:15:51-0300 [my_very_nice_spider] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-17 09:15:54-0300 [my_very_nice_spider] DEBUG: Retrying <POST http://xxx.xxx.xxx.xxx/reporting/> (failed 2 times): User timeout caused connection failure: Getting http://xxx.xxx.xxx.xxx/reporting/ took longer than 180 seconds..
...

Try using the HTTP 1.0 downloader: 尝试使用HTTP 1.0下载器:

# settings.py
DOWNLOAD_HANDLERS = {
    'http': 'scrapy.core.downloader.handlers.http10.HTTP10DownloadHandler',
    'https': 'scrapy.core.downloader.handlers.http10.HTTP10DownloadHandler',
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM