简体   繁体   中英

Scrapy keeps crawling and never stops… CrawlSpider rules

I'm very new to python and scrapy and decided to try and built a spider instead of just being scared of the new/challenging looking language.

So this is the first spider and it's purpose:

  • It runs through a website's pages (through links it finds on every page)
  • List all the links (a>href) that exist on every page
  • Writes down in each row: the page where the links were found, the links themselves (decoded+languages), number of links on every page, and http response code of every link.

The problem I'm encountering is that it's never stopping the crawl, it seems stuck in a loop and always re-crawling every page more then once...

What did I do wrong? (obviously many things since I never wrote a python code before, but still) How can I make the spider crawl every page only once?

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
import urllib.parse
import requests
import threading


class TestSpider(CrawlSpider):
    name = "test"
    allowed_domains = ["cerve.co"]
    start_urls = ["https://cerve.co"]
        rules = [Rule (LinkExtractor(allow=['.*'], tags='a', attrs='href'), callback='parse_item', follow=True)]
    
        def parse_item(self, response):
             alllinks = response.css('a::attr(href)').getall()
             for link in alllinks:
                 link = response.urljoin(link)
                 yield {
                    'page': urllib.parse.unquote(response.url),
                    'links': urllib.parse.unquote(link),
                    'number of links': len(alllinks),
                    'status': requests.get(link).status_code
                 }

Scrapy said: By default, Scrapy filters out duplicated requests to URLs already visited. This can be configured by the setting DUPEFILTER_CLASS .

Solution 1: https://docs.scrapy.org/en/latest/topics/settings.html#std-setting-DUPEFILTER_CLASS

My experience with your code: There are so many links. And i did not see any duplicates urls being visited twice.

Solutions 2 in worst case
In settings.py set DEPTH_LIMIT = some number of your choice

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM