[英]Scrapy crawl a local website by IP address
I'm still experimenting with Scrapy, and I'm trying to crawl a website on my local network. 我还在尝试使用Scrapy,我正在尝试抓取本地网络上的网站。 The website has the IP address 192.168.0.185.
该网站的IP地址是192.168.0.185。 This is my spider:
这是我的蜘蛛:
from scrapy.spider import BaseSpider
class 192.168.0.185_Spider(BaseSpider):
name = "192.168.0.185"
allowed_domains = ["192.168.0.185"]
start_urls = ["http://192.168.0.185/"]
def parse(self, response):
print "Test:", response.headers
And then in the same directory as my spider I'd execute this shell command to run the spider: 然后在与我的蜘蛛相同的目录中,我将执行此shell命令来运行蜘蛛:
scrapy crawl 192.168.0.185
And I get a very ugly, unreadable error message: 我得到一个非常难看,不可读的错误消息:
2012-02-10 20:55:18-0600 [scrapy] INFO: Scrapy 0.14.0 started (bot: tutorial)
2012-02-10 20:55:18-0600 [scrapy] DEBUG: Enabled extensions: LogStats,
TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
2012-02-10 20:55:18-0600 [scrapy] DEBUG: Enabled downloader middlewares:
HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware,
DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware,
HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2012-02-10 20:55:18-0600 [scrapy] DEBUG: Enabled spider middlewares:
HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware,
DepthMiddleware 2012-02-10 20:55:18-0600 [scrapy] DEBUG: Enabled item pipelines:
Traceback (most recent call last): File "/usr/bin/scrapy", line 5, in <module>
pkg_resources.run_script('Scrapy==0.14.0', 'scrapy')
File "/usr/lib/python2.6/site-packages/pkg_resources.py", line 467, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib/python2.6/site-packages/pkg_resources.py", line 1200, in run_script
execfile(script_filename, namespace, namespace)
File "/usr/lib/python2.6/site-packages/Scrapy-0.14.0-py2.6.egg/EGG-INFO/scripts
/scrapy", line 4, in <module>
execute()
File "/usr/lib/python2.6/site-packages/Scrapy-0.14.0-py2.6.egg/scrapy/cmdline.py",
line 132, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "/usr/lib/python2.6/site-packages/Scrapy-0.14.0-py2.6.egg/scrapy/cmdline.py",
line 97, in _run_print_help func(*a, **kw)
File "/usr/lib/python2.6/site-packages/Scrapy-0.14.0-py2.6.egg/scrapy/cmdline.py",
line 139, in _run_command cmd.run(args, opts)
File "/usr/lib/python2.6/site-packages/Scrapy-0.14.0-py2.6.egg/scrapy/commands
/crawl.py", line 43, in run
spider = self.crawler.spiders.create(spname, **opts.spargs)
File "/usr/lib/python2.6/site-packages/Scrapy-0.14.0-py2.6.egg/scrapy
/spidermanager.py", line 43, in create
raise KeyError("Spider not found: %s" % spider_name)
KeyError: 'Spider not found: 192.168.0.185'
So then I made another spider, which is practically the same as the first one, except it uses a domain name rather than an IP address. 然后我制作了另一个蜘蛛,它实际上与第一个蜘蛛相同,只是它使用域名而不是IP地址。 This one worked just fine.
这个工作得很好。 Does anyone know what the deal is?
有谁知道这笔交易是什么? How can I get Scrapy to crawl a website via IP address as opposed to a domain name?
如何让Scrapy通过IP地址而不是域名来抓取网站?
from scrapy.spider import BaseSpider
class facebook_Spider(BaseSpider):
name = "facebook"
allowed_domains = ["facebook.com"]
start_urls = ["http://www.facebook.com/"]
def parse(self, response):
print "Test:", response.headers
class 192.168.0.185_Spider(BaseSpider):
...
You can't use class name which begins with digit or contains dots in Python. 您不能使用以数字开头或在Python中包含点的类名。 See documentation Identifiers and keywords
请参阅文档标识符和关键字
You can create this spider with correct name: 您可以使用正确的名称创建此蜘蛛:
$ scrapy startproject testproj
$ cd testproj
$ scrapy genspider testspider 192.168.0.185
Created spider 'testspider' using template 'crawl' in module:
testproj.spiders.testspider
Spider definition will look like this: 蜘蛛定义将如下所示:
class TestspiderSpider(CrawlSpider):
name = 'testspider'
allowed_domains = ['192.168.0.185']
start_urls = ['http://www.192.168.0.185/']
...
And probably you should delete www
from start_urls
. 也许你应该从
start_urls
删除www
。 To start crawling, use spider name instead host: 要开始抓取,请使用蜘蛛名称代替主机:
$ scrapy crawl testspider
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.