Scrapy 不會爬行

Question

    # -*- coding: utf-8 -*-
    import scrapy

    class ProvasSpider(scrapy.Spider):
        name = 'provas'
        allowed_domains = ['folhadirigida.com.br']
        start_urls = ['https://folhadirigida.com.br/']

        def parse(self, response): #criando pagina
            page = response.url.split ("/")[-3]
            filename = '%s.html' % page
            with open(filename, 'wb') as f:
                f.write(response.body)

當我運行這個程序來抓取這個頁面時，我得到： 這個圖片 . 例如，如果我在此頁面上運行相同的程序，我將獲得此頁面的精確副本。 為什么它不適用於第一頁？

Answer 1

具有相對路徑的CSS文件不起作用。抓取頁面上的所有鏈接都必須具有絕對路徑。

Answer 2

由於在您的 HTML 頁面中缺少<base href="https://folhadirigida.com.br/"> ，它看起來不太好。

import scrapy

class ProvasSpider(scrapy.Spider):
    name = 'provas'
    allowed_domains = ['folhadirigida.com.br']
    start_urls = ['https://folhadirigida.com.br/']

    def parse(self, response): #criando pagina
        page = response.url.split ("/")[-2]
        filename = '%s.html' % page
        with open(filename, 'wb') as f:
            ref_body = response.body[:42] + b'<base href="https://folhadirigida.com.br/">'\
                       + response.body[42:]
            f.write(ref_body)

像這段代碼一樣將它添加到您的 HTML 主體將使頁面看起來不錯。

Scrapy 不會爬行

問題描述

1 個解決方案

解決方案1
0 2019-11-29 09:32:48

解決方案2
0 已采納 2019-11-29 22:28:14

Scrapy 不會爬行

問題描述

1 個解決方案

解決方案1 0 2019-11-29 09:32:48

解決方案2 0 已采納 2019-11-29 22:28:14

解決方案1
0 2019-11-29 09:32:48

解決方案2
0 已采納 2019-11-29 22:28:14