I'm very new to python and scrapy and decided to try and built a spider instead of just being scared of the new/challenging looking language.
So this is the first spider and it's purpose:
The problem I'm encountering is that it's never stopping the crawl, it seems stuck in a loop and always re-crawling every page more then once...
What did I do wrong? (obviously many things since I never wrote a python code before, but still) How can I make the spider crawl every page only once?
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
import urllib.parse
import requests
import threading
class TestSpider(CrawlSpider):
name = "test"
allowed_domains = ["cerve.co"]
start_urls = ["https://cerve.co"]
rules = [Rule (LinkExtractor(allow=['.*'], tags='a', attrs='href'), callback='parse_item', follow=True)]
def parse_item(self, response):
alllinks = response.css('a::attr(href)').getall()
for link in alllinks:
link = response.urljoin(link)
yield {
'page': urllib.parse.unquote(response.url),
'links': urllib.parse.unquote(link),
'number of links': len(alllinks),
'status': requests.get(link).status_code
}
Scrapy said: By default, Scrapy filters out duplicated requests to URLs already visited. This can be configured by the setting DUPEFILTER_CLASS
.
Solution 1: https://docs.scrapy.org/en/latest/topics/settings.html#std-setting-DUPEFILTER_CLASS
My experience with your code: There are so many links. And i did not see any duplicates urls being visited twice.
Solutions 2 in worst case
In settings.py
set DEPTH_LIMIT
= some number of your choice
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.