import scrapy
from scrapy.spiders.crawl import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = 'genericSpider'
allowed_domains = ['example.com']
start_urls = [url_1, url_2, url_3]
rules = [
Rule(
LinkExtractor(),
callback='parse',
follow=True
),
]
def parse(self, response):
hxs = scrapy.Selector(response)
links = hxs.xpath('*//a/@href').extract()
for link in links:
print(link)
print()
I'm attempting to crawl a website. For an example of my code, I'm just extracting all links and printing them out to the terminal.
This process works great for the urls in the start_urls, but it doesn't seem that the spider will crawl the extracted urls.
This is the point of the CrawlSpider, correct? visit a page, collect its links and visit all those links until it runs out of them?
I've been stuck for a few days, any help would be great.
The problem is that you name your method parse
. As per the documentation , this name should be avoided in case of using CrawlSpider
as it leads to problems. Just rename the method to eg parse_link
(and adjust the callback
argument in Rule
) and it will work.
Also, remember that allowed_domains
attribute must match with URLs you intend to crawl.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.