使用保存在本地系統中的 html 抓取文件

Question

例如，我有一個站點"www.example.com"實際上我想通過保存到本地系統來抓取該站點的 html。 所以為了測試我將該頁面保存在我的桌面上作為example.html

現在我已經為此編寫了蜘蛛代碼，如下所示

class ExampleSpider(BaseSpider):
   name = "example"
   start_urls = ["example.html"]

   def parse(self, response):
       print response
       hxs = HtmlXPathSelector(response)

但是當我運行上面的代碼時，我收到了如下錯誤

ValueError: Missing scheme in request url: example.html

最后，我的意圖是抓取由保存在本地系統中的www.example.com html 代碼組成的example.html文件

任何人都可以建議我如何在 start_urls 中分配該 example.html 文件

提前致謝

Answer 1

您可以使用以下形式的 url 抓取本地文件：

 file:///path/to/file.html

Answer 2

您可以使用 HTTPCacheMiddleware，這將使您能夠從緩存中運行蜘蛛程序。 HTTPCacheMiddleware 設置的文檔位於此處。

基本上，將以下設置添加到 settings.py 將使其工作：

HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0 # Set to 0 to never expire

然而，這需要從網絡執行初始蜘蛛運行以填充緩存。

Answer 3

在scrapy中，您可以使用以下方法抓取本地文件：

class ExampleSpider(BaseSpider):
   name = "example"
   start_urls = ["file:///path_of_directory/example.html"]

   def parse(self, response):
       print response
       hxs = HtmlXPathSelector(response)

我建議你使用scrapy shell 'file:///path_of_directory/example.html' 檢查它

Answer 4

只是為了分享我喜歡用本地文件進行抓取的方式：

import scrapy
import os

LOCAL_FILENAME = 'example.html'
LOCAL_FOLDER = 'html_files'
BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))


class ExampleSpider(scrapy.Spider):
    name = "example"
    start_urls = [
        f"file://{BASE_DIR}/{LOCAL_FOLDER}/{LOCAL_FILENAME}"
    ]

我正在使用 f-strings (python 3.6+)( https://www.python.org/dev/peps/pep-0498/ )，但您可以根據需要使用 %-formatting 或 str.format() 進行更改.

Answer 5

scrapy shell "file:E:\folder\to\your\script\Scrapy\teste1\teste1.html"

今天在 Windows 10 上這對我有用。我必須放置沒有 //// 的完整路徑。

Answer 6

你可以簡單地做

def start_requests(self):
    yield Request(url='file:///path_of_directory/example.html')

Answer 7

如果您查看 scrapy Request 的源代碼，例如github 。 您可以了解什么 scrapy 向 http 服務器發送請求並從服務器獲取所需頁面以響應。 您的文件系統不是 http 服務器。 為了使用scrapy進行測試，您必須設置http服務器。 然后您可以將網址分配給scrapy，例如

http://127.0.0.1/example.html

使用保存在本地系統中的 html 抓取文件

問題描述

7 個解決方案

解決方案1
30 2014-03-05 19:56:23

解決方案2
13 2012-06-05 12:27:23

解決方案3
5 2018-11-20 10:39:50

解決方案4
2 2020-05-22 19:08:23

解決方案5
1 2019-05-01 22:41:13

解決方案6
0 2021-03-11 05:51:56

解決方案7
-6 2012-06-05 12:04:49

使用保存在本地系統中的 html 抓取文件

問題描述

7 個解決方案

解決方案1 30 2014-03-05 19:56:23

解決方案2 13 2012-06-05 12:27:23

解決方案3 5 2018-11-20 10:39:50

解決方案4 2 2020-05-22 19:08:23

解決方案5 1 2019-05-01 22:41:13

解決方案6 0 2021-03-11 05:51:56

解決方案7 -6 2012-06-05 12:04:49

解決方案1
30 2014-03-05 19:56:23

解決方案2
13 2012-06-05 12:27:23

解決方案3
5 2018-11-20 10:39:50

解決方案4
2 2020-05-22 19:08:23

解決方案5
1 2019-05-01 22:41:13

解決方案6
0 2021-03-11 05:51:56

解決方案7
-6 2012-06-05 12:04:49