[英]can't crawl page with simple scrapy spider
我對scrapy非常陌生,我正在嘗試使用簡單的Spider(基於在此處找到的另一個Spider構建的網站: http ://scraping.pro/web-scraping-python-scrapy-blog-series/)來抓取網站。
為什么我的蜘蛛會抓取0頁(沒有錯誤):
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from items import NewsItem
class TutsPlus(CrawlSpider):
name = "tutsplus"
allowed_domains = ["net.tutsplus.com"]
start_urls = [
"http://code.tutsplus.com/posts?page="
]
rules = [Rule(LinkExtractor(allow=['/posts?page=\d+']), 'parse_story')]
def parse_story(self, response):
story = NewsItem()
story['url'] = response.url
story['title'] = response.xpath("//li[@class='posts__post']/a/text()").extract()
return story
和非常相似的蜘蛛運行良好:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from items import NewsItem
class BbcSpider(CrawlSpider):
name = "bbcnews"
allowed_domains = ["bbc.co.uk"]
start_urls = [
"http://www.bbc.co.uk/news/technology/",
]
rules = [Rule(LinkExtractor(allow=['/technology-\d+']), 'parse_story')]
def parse_story(self, response):
story = NewsItem()
story['url'] = response.url
story['headline'] = response.xpath("//title/text()").extract()
story['intro'] = response.css('story-body__introduction::text').extract()
return story
看起來您的正則表達式'/posts?page=\\d+'
不是您真正想要的,因為它匹配以下網址: '/postspage=2'
和'/postpage=2'
。
我認為您想要類似'/posts\\?page=\\d+'
,該東西可以轉義?
。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.