简体   繁体   English

Scrapy:抓取 start_urls 导致问题

[英]Scrapy : crawling start_urls causing issues

In my start_urls if I define the home page then scrapy doesn't crawl the page and the "if" check in parse_item function is never hit (eg : 'someurl.com/medical/patient-info').在我的 start_urls 中,如果我定义了主页,scrapy 不会抓取该页面并且 parse_item 函数中的“if”检查永远不会被命中(例如:'someurl.com/medical/patient-info')。 But when I provide the same page url in start url (ie start_urls = 'someurl.com/medical/patient-info) it crawls it and hits the below check in parse_item但是,当我在起始 url(即 start_urls = 'someurl.com/medical/patient-info)中提供相同的页面 url 时,它会抓取它并点击 parse_item 中的以下检查

      from scrapy.spider import BaseSpider
      from scrapy.contrib.spiders.init import InitSpider
      from scrapy.http import Request, FormRequest
      from scrapy.selector import HtmlXPathSelector
      from tutorial.items import DmozItem
      from scrapy.contrib.spiders import CrawlSpider, Rule
      from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
      import urlparse
      from scrapy import log

      class MySpider(CrawlSpider):

          items = []
          failed_urls = []
          duplicate_responses = []

          name = 'myspiders'
          allowed_domains = ['someurl.com']
          login_page = 'someurl.com/login_form'
          start_urls = 'someurl.com/' # Facing problem for the url here

          rules = [Rule(SgmlLinkExtractor(deny=('logged_out', 'logout',)),         follow=True, callback='parse_item')]

          def start_requests(self):

              yield Request(
                  url=self.login_page,
                  callback=self.login,
                  dont_filter=False
                  )


          def login(self, response):
              """Generate a login request."""
              return FormRequest.from_response(response,
                formnumber=1,
                formdata={'username': 'username', 'password': 'password' },
                callback=self.check_login_response)


          def check_login_response(self, response):
              """Check the response returned by a login request to see if we are
              successfully logged in.
              """
              if "Logout" in response.body:
                  self.log("Successfully logged in. Let's start crawling! :%s" % response, level=log.INFO)
                  self.log("Response Url : %s" % response.url, level=log.INFO)

                  return Request(url=self.start_urls)
              else:
                  self.log("Bad times :(", loglevel=log.INFO)


          def parse_item(self, response):


              # Scrape data from page
              hxs = HtmlXPathSelector(response)

              self.log('response came in from : %s' % (response), level=log.INFO)

              # check for some important page to crawl
              if response.url == 'someurl.com/medical/patient-info' :

                  self.log('yes I am here', level=log.INFO)

                  urls = hxs.select('//a/@href').extract()
                  urls = list(set(urls))


                  for url in urls :

                      self.log('URL extracted : %s' % url, level=log.INFO)

                      item = DmozItem()

                      if response.status == 404 or response.status == 500:
                          self.failed_urls.append(response.url)
                          self.log('failed_url : %s' % self.failed_urls, level=log.INFO)
                          item['failed_urls'] = self.failed_urls

                      else :

                          if url.startswith('http') :
                              if url.startswith('someurl.com'):
                                  item['internal_link'] = url
                                  self.log('internal_link :%s' % url, level=log.INFO)
                              else :
                                  item['external_link'] = url
                                  self.log('external_link :%s' % url, level=log.INFO)

                      self.items.append(item)

                  self.items = list(set(self.items))
                  return self.items
              else :
                  self.log('did not recieved expected response', level=log.INFO)

I guess start_urls has to be a list.我猜start_urls必须是一个列表。

Try the following: start_urls = ['http://www.someurl.com/', ]尝试以下操作: start_urls = ['http://www.someurl.com/', ]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM