Scrapy LinkExtractor 爬取使用父目錄的鏈接

Question

在 Scrapy 中使用基本的CrawlerSpider ，我正在嘗試抓取頁面。 我要爬取的頁面中的相關鏈接都以父目錄符號..而不是完整域開頭。

例如，如果我從頁面https://www.mytarget.com/posts/4/friendly-url開始，並且我想抓取/posts中的每個帖子，則該頁面上的相關鏈接將是：

'../55/post-name'
'../563/another-name'

代替：

'posts/55/post-name'
'posts/563/another-name'

或者什么會更好：

'https://www.mytarget.com/posts/55/post-name'
'https://www.mytarget.com/posts/563/another-name'

從allowed_domains中刪除mytarget.com似乎沒有幫助。 爬蟲不會在網站上找到與..父目錄鏈接引用匹配的新鏈接。

這是我的代碼：

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from exercise_data_collector.items import Post

class MyCrawlerSpider(CrawlSpider):
    name = 'my_crawler'
    allowed_domains = ['mytarget.com']
    start_urls = ['https://www.mytarget.com/posts/4/friendly-url']

    rules = (
        Rule(LinkExtractor(allow=r'posts/[0-9]+/[0-9A-Za-z-_]+'), callback='parse_item', follow=True),
        Rule(LinkExtractor(allow=r'/posts\/[0-9]+\/[0-9A-Za-z-_]+'), callback='parse_item', follow=True),
        Rule(LinkExtractor(allow=r'/..\/[0-9]+\/[0-9A-Za-z-_]+'), callback='parse_item', follow=True),
    )

    def parse(self, response):
        links = self.le1.extract_links(response)

        item = Post()
        item["page_title"] = response.xpath('//title/text()').get()
        item["name"] = response.xpath("//div[@class='container']/div[@class='row']/div[1]/div[1]/text()[2]").get().replace('->','').strip()
        item['difficulty'] = response.xpath("//p[strong[contains(text(), 'Difficulty')]]/text()").get().strip()

        return item

我不確定如何配置正則表達式以獲取相關鏈接，甚至不確定正則表達式是否在regexr.com之外工作。

我怎樣才能抓取這樣的頁面？

Answer 1

我用這個正則表達式r'posts/[0-9]+/[A-Za-z-_]+'解決了這個問題

class MyCrawlerSpider(CrawlSpider):
    name = 'my_crawler'
    allowed_domains = ['mytarget.com']
    start_urls = ['https://www.mytarget.com/posts/4/friendly-url']

    rules = (
        Rule(LinkExtractor(allow=r'exercises/[0-9]+/[A-Za-z-_]+'), callback='parse_item', follow=True)
    )
    def parse(self, response):
        links = self.le1.extract_links(response)

        item = Post()
        item["page_title"] = response.xpath('//title/text()').get()
        item["name"] = response.xpath("//div[@class='container']/div[@class='row']/div[1]/div[1]/text()[2]").get().replace('->','').strip()
        item['difficulty'] = response.xpath("//p[strong[contains(text(), 'Difficulty')]]/text()").get().strip()

        return item

我確實遇到了一個遞歸問題， posts/12/page.html更改為posts/12/12/page.html ... posts/12/12/12/12/12/12/page.html 我認為這可能是他們網站上的錯誤。

Scrapy LinkExtractor 爬取使用父目錄的鏈接

問題描述

1 個解決方案

解決方案1
1 已采納 2020-06-15 13:51:13

Scrapy LinkExtractor 爬取使用父目錄的鏈接

問題描述

1 個解決方案

解決方案1 1 已采納 2020-06-15 13:51:13

解決方案1
1 已采納 2020-06-15 13:51:13