絕對的相對路徑

Question

我正在嘗試搜尋一個論壇，以便最終找到帖子中包含鏈接的帖子。 現在，我只是想抓取帖子的用戶名。 但是我認為網址不是靜態的存在問題。

spider.py

from scrapy.spiders import CrawlSpider
from scrapy.selector import Selector
from scrapy.item import Item, Field


class TextPostItem(Item):
    title = Field()
    url = Field()
    submitted = Field()


class RedditCrawler(CrawlSpider):
    name = 'post-spider'
    allowed_domains = ['flashback.org']
    start_urls = ['https://www.flashback.org/t2637903']


    def parse(self, response):
        s = Selector(response)
        next_link = s.xpath('//a[@class="smallfont2"]//@href').extract()[0]
        if len(next_link):
            yield self.make_requests_from_url(next_link)
        posts =   Selector(response).xpath('//div[@id="posts"]/div[@class="alignc.p4.post"]')
        for post in posts:
            i = TextPostItem()
            i['title'] = post.xpath('tbody/tr[1]/td/span/text()').extract() [0]
            #i['url'] = post.xpath('div[2]/ul/li[1]/a/@href').extract()[0]
            yield i

提供以下錯誤：

raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: /t2637903p2

任何想法？

Answer 1

您需要將response.url與使用urljoin()提取的相對URL“連接” urljoin() ：

from urlparse import urljoin

urljoin(response.url, next_link)

另請注意，無需實例化Selector對象-您可以直接使用response.xpath()快捷方式：

def parse(self, response):
    next_link = response.xpath('//a[@class="smallfont2"]//@href').extract()[0]
    # ...

絕對的相對路徑

問題描述

1 個解決方案

解決方案1
1 已采納 2015-10-23 00:31:32

絕對的相對路徑

問題描述

1 個解決方案

解決方案1 1 已采納 2015-10-23 00:31:32

解決方案1
1 已采納 2015-10-23 00:31:32