简体   繁体   中英

Python - Scrapy - Creating a crawler that gets a list of URLs and crawls them

I am trying to create a spider with the package "Scrapy" that gets a lists of URLs and crawls them. I have searched stackoverflow for an answer but could not find something that will solve the issue.

My script is as follows:

class Try(scrapy.Spider):
   name = "Try"

   def __init__(self, *args, **kwargs):
      super(Try, self).__init__(*args, **kwargs)
      self.start_urls = kwargs.get( "urls" )
      print( self.start_urls )

   def start_requests(self):
      print( self.start_urls )
      for url in self.start_urls:
          yield Request( url , self.parse )

   def parse(self, response):
      d = response.xpath( "//body" ).extract()

When I crawl the spider:

Spider = Try(urls = [r"https://www.example.com"])
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(Spider)
process.start()

I get the following info printed while printing self.start_urls:

  • In the __init__ function printed on screen is: [r" https://www.example.com "] (as passed to the spider).
  • In the start_requests function printed on screen is: None

Why do I get None? Is there another way to approach this issue? or Is there any mistakes in my spider's class?

Thanks for any help given!

I would suggest to use the Spider Class in process.crawl and pass urls parameters there.

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy import Request


class Try(scrapy.Spider):
   name = 'Try'

   def __init__(self, *args, **kwargs):
      super(Try, self).__init__(*args, **kwargs)
      self.start_urls = kwargs.get("urls")

   def start_requests(self):
      for url in self.start_urls:
          yield Request( url , self.parse )

   def parse(self, response):
      d = response.xpath("//body").extract()

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(Try, urls=[r'https://www.example.com'])
process.start()

If I run

process.crawl(Try, urls=[r"https://www.example.com"])

then it send urls to Try as I expect. And even I don't need start_requests .

import scrapy

class Try(scrapy.Spider):

   name = "Try"

   def __init__(self, *args, **kwargs):
       super(Try, self).__init__(*args, **kwargs)
       self.start_urls = kwargs.get("urls")

   def parse(self, response):
       print('>>> url:', response.url)
       d = response.xpath( "//body" ).extract()

from scrapy.crawler import CrawlerProcess

process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(Try, urls=[r"https://www.example.com"])
process.start()

But if I use

spider = Try(urls = ["https://www.example.com"])

process.crawl(spider)

then it looks like it runs new Try without urls and then list is empty.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM