简体   繁体   English

Scrapy正在关注并抓取非允许的链接

[英]Scrapy is following and scraping non-allowed links

I have a CrawlSpider set up to following certain links and scrape a news magazine where the links to each issue follow the following URL scheme: 我有一个CrawlSpider设置为遵循某些链接并刮取一个新闻杂志,其中每个问题的链接遵循以下URL方案:

http://example.com/YYYY/DDDD/index.htm where YYYY is the year and DDDD is the three or four digit issue number. http://example.com/YYYY/DDDD/index.htm其中YYYY是年份,DDDD是三位或四位数的发行号。

I only want issues 928 onwards, and have my rules below. 我只想要问题928以及以下规则。 I don't have any problem connecting to the site, crawling links, or extracting items (so I didn't include the rest of my code). 我没有任何问题连接到网站,抓取链接或提取项目(所以我没有包括我的其余代码)。 The spider seems determined to follow non-allowed links. 蜘蛛似乎决心遵循非允许的链接。 It is trying to scrape issues 377, 398, and more, and follows the "culture.htm" and "feature.htm" links. 它试图抓住问题377,398等,并遵循“culture.htm”和“feature.htm”链接。 This throws a lot of errors and isn't terribly important but it requires a lot of cleaning of the data. 这会引发很多错误并且不是非常重要,但它需要大量清理数据。 Any suggestions as to what is going wrong? 对于出了什么问题的任何建议?

class crawlerNameSpider(CrawlSpider):
name = 'crawler'
allowed_domains = ["example.com"]
start_urls = ["http://example.com/issues.htm"]

rules = (
        Rule(SgmlLinkExtractor(allow = ('\d\d\d\d/(92[8-9]|9[3-9][0-9]|\d\d\d\d)/index\.htm', )), follow = True),
        Rule(SgmlLinkExtractor(allow = ('fr[0-9].htm', )), callback = 'parse_item'),
        Rule(SgmlLinkExtractor(allow = ('eg[0-9]*.htm', )), callback = 'parse_item'),
        Rule(SgmlLinkExtractor(allow = ('ec[0-9]*.htm', )), callback = 'parse_item'),
        Rule(SgmlLinkExtractor(allow = ('op[0-9]*.htm', )), callback = 'parse_item'),
        Rule(SgmlLinkExtractor(allow = ('sc[0-9]*.htm', )), callback = 'parse_item'),
        Rule(SgmlLinkExtractor(allow = ('re[0-9]*.htm', )), callback = 'parse_item'),
        Rule(SgmlLinkExtractor(allow = ('in[0-9]*.htm', )), callback = 'parse_item'),
        Rule(SgmlLinkExtractor(deny = ('culture.htm', )), ),
        Rule(SgmlLinkExtractor(deny = ('feature.htm', )), ),
    )

EDIT: I fixed this using a much simpler regex fot 2009, 2010, 2011, but I am still curious why the above doesn't work if anyone has any suggestions. 编辑:我使用一个更简单的正则表达式2009年,2010年,2011年修复此问题,但我仍然很好奇为什么如果有人有任何建议上述不起作用。

You need to pass deny arguments to SgmlLinkExtractor which collects links to follow . 您需要将deny参数传递给SgmlLinkExtractor ,后者收集要follow链接。 And you don't need to create so many Rule 's if they call one function parse_item . 如果他们调用一个函数parse_item ,你就不需要创建那么多Rule I would write your code as: 我会把你的代码写成:

rules = (
        Rule(SgmlLinkExtractor(
                    allow = ('\d\d\d\d/(92[8-9]|9[3-9][0-9]|\d\d\d\d)/index\.htm', ),
                    deny = ('culture\.htm', 'feature\.htm'),
                    ), 
            follow = True
        ),
        Rule(SgmlLinkExtractor(
                allow = (
                    'fr[0-9].htm', 
                    'eg[0-9]*.htm',
                    'ec[0-9]*.htm',
                    'op[0-9]*.htm',
                    'sc[0-9]*.htm',
                    're[0-9]*.htm',
                    'in[0-9]*.htm',
                    )
                ), 
                callback = 'parse_item',
        ),
    )

If it's real url patterns in rules you are using to parse_item , it can be simplified to this: 如果它是你用于parse_item规则中的真实url模式,它可以简化为:

 Rule(SgmlLinkExtractor(
                allow = ('(fr|eg|ec|op|sc|re|in)[0-9]*\.htm', ), 
                callback = 'parse_item',
        ),
 )

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM