[英]Scrapy crawl spider only touch start_urls
I found that my CrawlSpider
only crawls start_urls
, and not going any further. 我发现我的
CrawlSpider
仅爬网start_urls
,并且没有进一步进行。
The following is my code. 以下是我的代码。
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class ExampleSpider(CrawlSpider):
name = 'example'
allowed_domains = ['holy-bible-eng']
start_urls = ['file:///G:/holy-bible-eng/OEBPS/bible-toc.xhtml']
rules = (
Rule(LinkExtractor(allow=r'OEBPS'), callback='parse_item', follow=True),
)
def parse_item(self, response):
return response
Below is my file:///G:/holy-bible-eng/OEBPS/bible-toc.xhtml
in start_urls
以下是我在
start_urls
file:///G:/holy-bible-eng/OEBPS/bible-toc.xhtml
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>Holy Bible</title><link href="lds_ePub_scriptures.css" rel="stylesheet" type="text/css" /></head><body class="bible-toc"><div class="titleBlock"><h1 class="toc-title">The Names and Order of All the <br /><span class="dominant">Books of the Old and <br />New Testaments</span></h1></div><div class="bible-toc"><p><a href="bible_dedication.xhtml">Epistle Dedicatory</a> | <a href="quad_abbreviations.xhtml">Abbreviations</a></p><h2 class="toc-title"><a href="ot.xhtml">The Books of the Old Testament</a></h2><p><a href="gen.xhtml">Genesis</a> | <a href="ex.xhtml">Exodus</a> | <a href="lev.xhtml">Leviticus</a> | <a href="num.xhtml">Numbers</a> | <a href="deut.xhtml">Deuteronomy</a> | <a href="josh.xhtml">Joshua</a> | <a href="judg.xhtml">Judges</a> | <a href="ruth.xhtml">Ruth</a> | <a href="1-sam.xhtml">1 Samuel</a> | <a href="2-sam.xhtml">2 Samuel</a> | <a href="1-kgs.xhtml">1 Kings</a> | <a href="2-kgs.xhtml">2 Kings</a> | <a href="1-chr.xhtml">1 Chronicles</a> | <a href="2-chr.xhtml">2 Chronicles</a> | <a href="ezra.xhtml">Ezra</a> | <a href="neh.xhtml">Nehemiah</a> | <a href="esth.xhtml">Esther</a> | <a href="job.xhtml">Job</a> | <a href="ps.xhtml">Psalms</a> | <a href="prov.xhtml">Proverbs</a> | <a href="eccl.xhtml">Ecclesiastes</a> | <a href="song.xhtml">Song of Solomon</a> | <a href="isa.xhtml">Isaiah</a> | <a href="jer.xhtml">Jeremiah</a> | <a href="lam.xhtml">Lamentations</a> | <a href="ezek.xhtml">Ezekiel</a> | <a href="dan.xhtml">Daniel</a> | <a href="hosea.xhtml">Hosea</a> | <a href="joel.xhtml">Joel</a> | <a href="amos.xhtml">Amos</a> | <a href="obad.xhtml">Obadiah</a> | <a href="jonah.xhtml">Jonah</a> | <a href="micah.xhtml">Micah</a> | <a href="nahum.xhtml">Nahum</a> | <a href="hab.xhtml">Habakkuk</a> | <a href="zeph.xhtml">Zephaniah</a> | <a href="hag.xhtml">Haggai</a> | <a href="zech.xhtml">Zechariah</a> | <a href="mal.xhtml">Malachi</a></p><h2 class="toc-title"><a href="nt.xhtml">The Books of the New Testament</a></h2><p><a href="matt.xhtml">Matthew</a> | <a href="mark.xhtml">Mark</a> | <a href="luke.xhtml">Luke</a> | <a href="john.xhtml">John</a> | <a href="acts.xhtml">Acts</a> | <a href="rom.xhtml">Romans</a> | <a href="1-cor.xhtml">1 Corinthians</a> | <a href="2-cor.xhtml">2 Corinthians</a> | <a href="gal.xhtml">Galatians</a> | <a href="eph.xhtml">Ephesians</a> | <a href="philip.xhtml">Philippians</a> | <a href="col.xhtml">Colossians</a> | <a href="1-thes.xhtml">1 Thessalonians</a> | <a href="2-thes.xhtml">2 Thessalonians</a> | <a href="1-tim.xhtml">1 Timothy</a> | <a href="2-tim.xhtml">2 Timothy</a> | <a href="titus.xhtml">Titus</a> | <a href="philem.xhtml">Philemon</a> | <a href="heb.xhtml">Hebrews</a> | <a href="james.xhtml">James</a> | <a href="1-pet.xhtml">1 Peter</a> | <a href="2-pet.xhtml">2 Peter</a> | <a href="1-jn.xhtml">1 John</a> | <a href="2-jn.xhtml">2 John</a> | <a href="3-jn.xhtml">3 John</a> | <a href="jude.xhtml">Jude</a> | <a href="rev.xhtml">Revelation</a></p><h2 class="toc-title"><a href="bible-helps_title-page.xhtml">Appendix</a></h2><p><a href="tg.xhtml">Topical Guide</a> | <a href="bd.xhtml">Bible Dictionary</a> | <a href="bible-chron.xhtml">Bible Chronology</a> | <a href="harmony.xhtml">Harmony of the Gospels</a> | <a href="jst.xhtml">Joseph Smith Translation</a> | <a href="bible-maps.xhtml">Bible Maps</a> | <a href="bible-photos.xhtml">Bible Photographs</a></p></div></body></html>
And the below is my console output. 下面是我的控制台输出。
(crawl) G:\kjvbible>scrapy crawl example
......
......
2017-04-08 09:24:59 [scrapy.core.engine] INFO: Spider opened
2017-04-08 09:24:59 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-04-08 09:24:59 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6026
2017-04-08 09:24:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET file:///G:/holy-bible-eng/OEBPS/bible-toc.xhtml> (referer: None)
2017-04-08 09:24:59 [scrapy.core.engine] INFO: Closing spider (finished)
2017-04-08 09:24:59 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 237,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 3693,
It doesn't go any deeper. 它没有深入。
Any suggestions would be welcome. 欢迎大家提出意见。
from CrawlSpider
documentation : 来自
CrawlSpider
文档 :
follow is a boolean which specifies if links should be followed from each response extracted with this rule.
follow是一个布尔值,它指定是否应从使用此规则提取的每个响应中跟随链接。 If callback is None follow defaults to True, otherwise it defaults to False
如果callback为None,则默认为True,否则默认为False
you cannot have a rule with callback
and follow=True
at the same time. 您不能同时使用
callback
和follow=True
规则。 It will only listen to the callback, and it won't go further. 它只会侦听回调,并且不会继续进行下去。
So the main idea behind CrawlSpider
's rules is that it can find links to follow and links to actually extract. 因此,
CrawlSpider
规则背后的主要思想是可以找到要遵循的链接和实际提取的链接。
Now scrapy
isn't the best idea to check your "local" files, for that just create a simple script. 现在,
scrapy
并不是检查您的“本地”文件的最佳主意,因为只需创建一个简单的脚本即可。
Another error is that you are setting the allowed_domains
class variable, which specifies which domains it should accept. 另一个错误是您正在设置
allowed_domains
类变量,该变量指定它应接受的域。 All the others are rejected, and this only works for links on the internet. 其他所有对象均被拒绝,这仅适用于Internet上的链接。 Remove that variable if you don't want to reject domains, or if you are not using domains at all (your case).
如果您不想拒绝域,或者根本不使用域(您的情况),请删除该变量。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.