简体   繁体   English

Python Scrapy-无法抓取

[英]Python Scrapy- Not able to crawl

I am trying to crawl some websites using scrapy. 我正在尝试使用scrapy抓取一些网站。 Below is a sample code. 以下是示例代码。 The method parse is not getting called. 方法解析不会被调用。 I am trying to run the code through a reactor service ( code provided ). 我正在尝试通过反应堆服务(提供的代码)运行代码。 So, I run it from startCrawling.py which has the reactor. 因此,我从具有反应堆的startCrawling.py运行它。 I know that I am missing something. 我知道我缺少什么。 Could you please help out. 你能帮忙吗。

Thanks, 谢谢,

Code-categorization.py

from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import Rule
from scrapy.selector import Selector
from scrapy.selector import HtmlXPathSelector
from items.items import CategorizationItem
from scrapy.contrib.spiders.crawl import CrawlSpider
class TestingSpider(CrawlSpider):
         print 'in spider'
         name = 'testSpider'
         allowed_domains = ['wikipedia.org']
         start_urls = ['http://www.wikipedia.org']
         def parse(self, response):

             # Scrape data from page
             print 'here'
             open('test.html','wb').write(response.body)

Code- startCrawling.py 代码startCrawling.py

 from twisted.internet import reactor
 from scrapy.crawler import Crawler
 from scrapy.settings import Settings
 from scrapy import log, signals
 from scrapy.xlib.pydispatch import dispatcher
 from scrapy.utils.project import get_project_settings

 from spiders.categorization import TestingSpider

 # Scrapy spiders script...

 def stop_reactor():
     reactor.stop #@UndefinedVariable    
     print 'hi'

     dispatcher.connect(stop_reactor, signal=signals.spider_closed) 
     spider = TestingSpider()
     crawler = Crawler(Settings())
     crawler.configure()
     crawler.crawl(spider)
     crawler.start()
     reactor.run()#@UndefinedVariable

You are not supposed to override the parse() method when using the CrawlSpider . 使用CrawlSpider时,您不应覆盖parse()方法。 You should set a custom callback in your Rule with a different name. 您应该在Rule使用其他名称设置自定义callback
Here is the excerpt from the official documentation : 这是官方文档的摘录:

When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. 编写爬网蜘蛛规则时,请避免将解析用作回调,因为CrawlSpider使用解析方法本身来实现其逻辑。 So if you override the parse method, the crawl spider will no longer work. 因此,如果您覆盖parse方法,则爬网蜘蛛将不再起作用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM