链接Scrapy后面的问题

Question

Trying to get my webcrawler to crawl links extracted from a webpage. 试图让我的webcrawler抓取从网页中提取的链接。 I'm using Scrapy. 我正在使用Scrapy。 I can successfully pull data with my crawler, but can't get it to crawl. 我可以使用我的抓取工具成功提取数据，但无法抓取它。 I believe the problem is in my rules section. 我相信问题出在我的规则部分。 New to Scrapy. Scrapy新手。 Thanks for you help in advance. 谢谢你的帮助提前。

I'm scraping this website: 我在抓这个网站：

http://ballotpedia.org/wiki/index.php/Category:2012_challenger

The links I'm trying to follow look like this in the source code: 我想要遵循的链接在源代码中如下所示：

/wiki/index.php/A._Ghani

or 要么

/wiki/index.php/A._Keith_Carreiro

Here is the code for my spider: 这是我的蜘蛛的代码：

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider,Rule

from ballot1.items import Ballot1Item

class Ballot1Spider(CrawlSpider):
   name = "stewie"
   allowed_domains = ["ballotpedia.org"]
   start_urls = [
       "http://ballotpedia.org/wiki/index.php/Category:2012_challenger"
   ]
   rules =  (
       Rule (SgmlLinkExtractor(allow=r'w+'), follow=True),
       Rule(SgmlLinkExtractor(allow=r'\w{4}/\w+/\w+'), callback='parse')
   )

 def parse(self, response):
   hxs = HtmlXPathSelector(response)
   sites = hxs.select('*')
   items = []
   for site in sites:
       item = Ballot1Item()
       item['candidate'] = site.select('/html/head/title/text()').extract()
       item['position'] = site.select('//table[@class="infobox"]/tr/td/b/text()').extract()
       item['controversies'] = site.select('//h3/span[@id="Controversies"]/text()').extract()
       item['endorsements'] = site.select('//h3/span[@id="Endorsements"]/text()').extract()
       item['currentposition'] = site.select('//table[@class="infobox"]/tr/td[@style="text-align:center; background-color:red;color:white; font-size:100%; font-weight:bold;"]/text()').extract()
       items.append(item)
   return items

Answer 1

The links that you're after are only present in this element: 您所关注的链接仅出现在此元素中：

<div lang="en" dir="ltr" class="mw-content-ltr">

So you have to restrict the XPath to prevent extraneous links: 因此，您必须限制XPath以防止无关链接：

restrict_xpaths='//div[@id="mw-pages"]/div'

Finally, you only want to follow links that look like /wiki/index.php?title=Category:2012_challenger&pagefrom=Alison+McCoy#mw-pages , so your final rules should look like: 最后，您只想关注类似/wiki/index.php?title=Category:2012_challenger&pagefrom=Alison+McCoy#mw-pages链接，因此您的最终规则应如下所示：

rules = (
    Rule(
        SgmlLinkExtractor(
            allow=r'&pagefrom='
        ),
        follow=True
    ),
    Rule(
        SgmlLinkExtractor(
            restrict_xpaths='//div[@id="mw-pages"]/div',
            callback='parse'
        )
    )
)

Answer 2

You're using a CrawlSpider with a callback of parse , which the scrapy documentation expressly warns will prevent crawling . 您正在使用带有回调parse的CrawlSpider ， scrapy文档明确警告这将阻止爬行。

Rename it to something like parse_items and you should be fine. 将它重命名为parse_items ，你应该没问题。

Answer 3

r'w+' is wrong (I think you meant r'\\w+' ) and r'\\w{4}/\\w+/\\w+' doesn't look right too, as it doesn't match your links (it's missing a leading / ). r'w+'是错的（我认为你的意思是r'\\w+' ）和r'\\w{4}/\\w+/\\w+'看起来也不正确，因为它与你的链接不匹配（它丢失了）领先/ ）。 Why don't you try just r'/wiki/index.php/.+' ? 你为什么不试试r'/wiki/index.php/.+' ？ Don't forget that \\w doesn't include . 不要忘记\\w不包括. and other symbols that can be parts of an article name. 和其他符号可以是文章名称的一部分。

链接Scrapy后面的问题

问题描述

3 个解决方案

解决方案1
1 2013-02-12 00:55:05

解决方案2
1 2013-02-12 12:00:11

解决方案3
0 2013-02-12 00:49:36

链接Scrapy后面的问题

问题描述

3 个解决方案

解决方案1 1 2013-02-12 00:55:05

解决方案2 1 2013-02-12 12:00:11

解决方案3 0 2013-02-12 00:49:36

解决方案1
1 2013-02-12 00:55:05

解决方案2
1 2013-02-12 12:00:11

解决方案3
0 2013-02-12 00:49:36