简体   繁体   English

将网页和相关资源下载到python中的WARC

[英]Downloading a webpage and associated resources to a WARC in python

I'm interested in downloading for later analysis a bunch of webpages. 我有兴趣下载以供以后分析一堆网页。 There are two things that I would like to do: 我想做两件事:

  • Download the page and associated resources (images, multiple pages associated with an article, etc) to a WARC file. 将页面和相关资源(图像,与文章关联的多个页面等)下载到WARC文件。
  • change all links to point to the now local files. 更改所有链接以指向现在的本地文件。

I would like to do this in Python. 我想用Python做到这一点。

Are there any good libraries for doing this? 这样做有什么好的库吗? Scrapy seems designed to scrape websites, rather than single pages, and I'm not sure how to generate WARC files. Scrapy似乎旨在抓取网站,而不是单页,我不知道如何生成WARC文件。 Calling out to wget is a doable solution if there isn't something more python native. 如果没有更多python原生的话,调用wget是一个可行的解决方案。 Heritrix is complete overkill, and not so much of a python solution. Heritrix完全矫枉过正,而不是一个python解决方案。 wpull would be ideal if it had a well documented python library, but it seems instead to be mostly an application. 如果它有一个记录良好的python库,wpull将是理想的,但它似乎主要是一个应用程序。

Any other ideas? 还有其他想法吗?

just use wget , is the simplest and most stable tool you can have to crawl web and save into a warc. 只需使用wget ,这是最简单,最稳定的工具,你可以抓取web并保存到warc中。

man wget , or just to start: 男人wget ,或者只是为了开始:

--warc-file=FILENAME        save request/response data to a .warc.gz file
-p,  --page-requisites           get all images, etc. needed to display HTML page

please note that you don't have to change any links , the warc preserve the original web pages. 请注意, 您不必更改任何链接 ,warc保留原始网页。 is the job of replay software (openwayback, pywb) to make the warc content browsable again. 重播软件(openwayback,pywb)的工作是使warc内容再次可浏览。

if you need to go with python: internetarchive/warc is the default library 如果你需要使用python: internetarchive / warc是默认库

take a look at this if you want manually crafting a warc file ampoffcom/htmlwarc 如果你想手动制作warc文件ampoffcom / htmlwarc,请看一下这个

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM