简体   繁体   中英

Scrapy crawl spider only touch start_urls

I found that my CrawlSpider only crawls start_urls , and not going any further.

The following is my code.

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class ExampleSpider(CrawlSpider):
    name = 'example'
    allowed_domains = ['holy-bible-eng']
    start_urls = ['file:///G:/holy-bible-eng/OEBPS/bible-toc.xhtml']

    rules = (
        Rule(LinkExtractor(allow=r'OEBPS'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        return response

Below is my file:///G:/holy-bible-eng/OEBPS/bible-toc.xhtml in start_urls

 <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>Holy Bible</title><link href="lds_ePub_scriptures.css" rel="stylesheet" type="text/css" /></head><body class="bible-toc"><div class="titleBlock"><h1 class="toc-title">The Names and Order of All the <br /><span class="dominant">Books of the Old and <br />New Testaments</span></h1></div><div class="bible-toc"><p><a href="bible_dedication.xhtml">Epistle Dedicatory</a> | <a href="quad_abbreviations.xhtml">Abbreviations</a></p><h2 class="toc-title"><a href="ot.xhtml">The Books of the Old Testament</a></h2><p><a href="gen.xhtml">Genesis</a> | <a href="ex.xhtml">Exodus</a> | <a href="lev.xhtml">Leviticus</a> | <a href="num.xhtml">Numbers</a> | <a href="deut.xhtml">Deuteronomy</a> | <a href="josh.xhtml">Joshua</a> | <a href="judg.xhtml">Judges</a> | <a href="ruth.xhtml">Ruth</a> | <a href="1-sam.xhtml">1 Samuel</a> | <a href="2-sam.xhtml">2 Samuel</a> | <a href="1-kgs.xhtml">1 Kings</a> | <a href="2-kgs.xhtml">2 Kings</a> | <a href="1-chr.xhtml">1 Chronicles</a> | <a href="2-chr.xhtml">2 Chronicles</a> | <a href="ezra.xhtml">Ezra</a> | <a href="neh.xhtml">Nehemiah</a> | <a href="esth.xhtml">Esther</a> | <a href="job.xhtml">Job</a> | <a href="ps.xhtml">Psalms</a> | <a href="prov.xhtml">Proverbs</a> | <a href="eccl.xhtml">Ecclesiastes</a> | <a href="song.xhtml">Song of Solomon</a> | <a href="isa.xhtml">Isaiah</a> | <a href="jer.xhtml">Jeremiah</a> | <a href="lam.xhtml">Lamentations</a> | <a href="ezek.xhtml">Ezekiel</a> | <a href="dan.xhtml">Daniel</a> | <a href="hosea.xhtml">Hosea</a> | <a href="joel.xhtml">Joel</a> | <a href="amos.xhtml">Amos</a> | <a href="obad.xhtml">Obadiah</a> | <a href="jonah.xhtml">Jonah</a> | <a href="micah.xhtml">Micah</a> | <a href="nahum.xhtml">Nahum</a> | <a href="hab.xhtml">Habakkuk</a> | <a href="zeph.xhtml">Zephaniah</a> | <a href="hag.xhtml">Haggai</a> | <a href="zech.xhtml">Zechariah</a> | <a href="mal.xhtml">Malachi</a></p><h2 class="toc-title"><a href="nt.xhtml">The Books of the New Testament</a></h2><p><a href="matt.xhtml">Matthew</a> | <a href="mark.xhtml">Mark</a> | <a href="luke.xhtml">Luke</a> | <a href="john.xhtml">John</a> | <a href="acts.xhtml">Acts</a> | <a href="rom.xhtml">Romans</a> | <a href="1-cor.xhtml">1 Corinthians</a> | <a href="2-cor.xhtml">2 Corinthians</a> | <a href="gal.xhtml">Galatians</a> | <a href="eph.xhtml">Ephesians</a> | <a href="philip.xhtml">Philippians</a> | <a href="col.xhtml">Colossians</a> | <a href="1-thes.xhtml">1 Thessalonians</a> | <a href="2-thes.xhtml">2 Thessalonians</a> | <a href="1-tim.xhtml">1 Timothy</a> | <a href="2-tim.xhtml">2 Timothy</a> | <a href="titus.xhtml">Titus</a> | <a href="philem.xhtml">Philemon</a> | <a href="heb.xhtml">Hebrews</a> | <a href="james.xhtml">James</a> | <a href="1-pet.xhtml">1 Peter</a> | <a href="2-pet.xhtml">2 Peter</a> | <a href="1-jn.xhtml">1 John</a> | <a href="2-jn.xhtml">2 John</a> | <a href="3-jn.xhtml">3 John</a> | <a href="jude.xhtml">Jude</a> | <a href="rev.xhtml">Revelation</a></p><h2 class="toc-title"><a href="bible-helps_title-page.xhtml">Appendix</a></h2><p><a href="tg.xhtml">Topical Guide</a> | <a href="bd.xhtml">Bible Dictionary</a> | <a href="bible-chron.xhtml">Bible Chronology</a> | <a href="harmony.xhtml">Harmony of the Gospels</a> | <a href="jst.xhtml">Joseph Smith Translation</a> | <a href="bible-maps.xhtml">Bible Maps</a> | <a href="bible-photos.xhtml">Bible Photographs</a></p></div></body></html> 

And the below is my console output.

(crawl) G:\kjvbible>scrapy crawl example
......
......

2017-04-08 09:24:59 [scrapy.core.engine] INFO: Spider opened
2017-04-08 09:24:59 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-04-08 09:24:59 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6026
2017-04-08 09:24:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET file:///G:/holy-bible-eng/OEBPS/bible-toc.xhtml> (referer: None)
2017-04-08 09:24:59 [scrapy.core.engine] INFO: Closing spider (finished)
2017-04-08 09:24:59 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 237,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 3693,

It doesn't go any deeper.

Any suggestions would be welcome.

from CrawlSpider documentation :

follow is a boolean which specifies if links should be followed from each response extracted with this rule. If callback is None follow defaults to True, otherwise it defaults to False

you cannot have a rule with callback and follow=True at the same time. It will only listen to the callback, and it won't go further.

So the main idea behind CrawlSpider 's rules is that it can find links to follow and links to actually extract.

Now scrapy isn't the best idea to check your "local" files, for that just create a simple script.

Another error is that you are setting the allowed_domains class variable, which specifies which domains it should accept. All the others are rejected, and this only works for links on the internet. Remove that variable if you don't want to reject domains, or if you are not using domains at all (your case).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM