简体   繁体   English

优化 WARC 生成以节省空间和时间

[英]Optimize WARC generation in order to save space and time

I am trying to do a WARC file of a very large list of links of several domains like that:我正在尝试做一个 WARC 文件,其中有一个非常大的几个域的链接列表:

wget --no-check-certificate \
     --no-verbose \
     --execute robots=off \
     --delete-after \
     --no-directories \
     --page-requisites \
     --mirror \
     --no-warc-keep-log \
     --output-file=out/15M.log \
     --warc-cdx \
     --span-hosts \
     --domains=15hack.tomalaplaza.net,15m20.tomalaplaza.net,15oct.takethesquare.net,actasmadrid.tomalaplaza.net,alcala.tomalaplaza.net,alcorcon.tomalosbarrios.net,alcosanse.tomalosbarrios.net,alicante.tomalaplaza.net,aluche.tomalosbarrios.net,andorra.tomalaplaza.net,antibanks.takethesquare.net,aragon.tomalaplaza.net,aravaca.tomalosbarrios.net,arganzuela.tomalosbarrios.net,arroyomolinos.tomalosbarrios.net,asambleademostoles.tomalosbarrios.net,asambleaplayasalicante.tomalosbarrios.net,asamblea-sanlorenzo-escorial.tomalosbarrios.net,austrias.tomalosbarrios.net,aviles.tomalaplaza.net,barajas.tomalosbarrios.net,barcelona.tomalaplaza.net,barriodelpilar.tomalosbarrios.net,barriosdelsur.tomalosbarrios.net,batan.tomalosbarrios.net,becerril.tomalosbarrios.net,benicarlo.tomalaplaza.net,berlinbienal.tomalaplaza.net,bilbao.tomalaplaza.net,boadilladelmonte.tomalosbarrios.net,boalo.tomalosbarrios.net,burgos.tomalaplaza.net,caceres.tomalaplaza.net,cadiz.tomalaplaza.net,canadareal.tomalosbarrios.net,castellon.tomalaplaza.net,cercedilla.tomalosbarrios.net,chamartin.tomalosbarrios.net,chapineria.tomalosbarrios.net,chiclana.tomalaplaza.net,chueca.tomalosbarrios.net,ciempozuelos.tomalosbarrios.net,ciudadlineal.tomalosbarrios.net,colladomediano.tomalosbarrios.net,colladovillalba.tomalosbarrios.net,colmenarejo.tomalosbarrios.net,colmenarviejo.tomalosbarrios.net,compostela.tomalaplaza.net,comunicacionestatal15m.tomalaplaza.net,contralaviolenciadegenero.tomalaplaza.net,cordoba.tomalaplaza.net,coslada.tomalosbarrios.net,daganzodearriba.tomalosbarrios.net,debatedelpueblo.tomalosbarrios.net,debatepopular.tomalosbarrios.net,dec10.takethesquare.net,desmontandomentiras.tomalaplaza.net,donostia.tomalaplaza.net,dosdemayo.tomalosbarrios.net,economia.tomalaplaza.net,elalamo.tomalosbarrios.net,elche.tomalaplaza.net,elejido.tomalosbarrios.net,enbustarviejo.tomalosbarrios.net,encuentro15m.tomalaplaza.net,foro.tomalosbarrios.net,fuencarral.tomalosbarrios.net,fuenlabrada.tomalosbarrios.net,galapagar.tomalosbarrios.net,gamonal.tomalosbarrios.net,gasteiz.tomalaplaza.net,getafe.tomalosbarrios.net,granada.tomalaplaza.net,grancanaria.tomalosbarrios.net,guadalixdelasierra.tomalosbarrios.net,guadarrama.tomalosbarrios.net,guindalera.tomalosbarrios.net,hacksol.tomalaplaza.net,hortaleza.tomalosbarrios.net,howtocamp.takethesquare.net,hoyodemanzanares.tomalosbarrios.net,ibiza.tomalaplaza.net,jerez.tomalaplaza.net,jitsi.tomalaplaza.net,laconce.tomalosbarrios.net,laelipa.tomalosbarrios.net,lasmatas.tomalosbarrios.net,laspalmas.tomalaplaza.net,lasrozas.tomalosbarrios.net,lastablassanchinarro.tomalosbarrios.net,lavapies.tomalosbarrios.net,leganes.tomalosbarrios.net,leon.tomalaplaza.net,letras.tomalosbarrios.net,listas.tomalaplaza.net,listas.tomalosbarrios.net,lists.takethesquare.net,lleida.tomalaplaza.net,logrono.tomalaplaza.net,lucero.tomalosbarrios.net,madrid15m.org,madridocm.tomalaplaza.net,madridsur.tomalosbarrios.net,madrid.tomalaplaza.net,madrid.tomalosbarrios.net,majadahonda.tomalosbarrios.net,malaga.tomalaplaza.net,marchestobrussels.takethesquare.net,mataro.tomalosbarrios.net,mayo2013.tomalaplaza.net,mejoradadelcampo.tomalosbarrios.net,menorca.tomalaplaza.net,miraflores.tomalosbarrios.net,montecarmelo.tomalosbarrios.net,moralzarzal.tomalosbarrios.net,mumble.tomalaplaza.net,navalafuente.tomalosbarrios.net,nudomanoteras.tomalosbarrios.net,nuevobaztan.tomalosbarrios.net,ocmdaganzo.tomalaplaza.net,optt.tomalaplaza.net,ourense.tomalaplaza.net,oviedo.tomalaplaza.net,pads.tomalaplaza.net,pamplona.tomalaplaza.net,paracuellos.tomalosbarrios.net,parla.tomalosbarrios.net,parlaverde.tomalosbarrios.net,paseoextremadura.tomalosbarrios.net,pedrezuela.tomalosbarrios.net,pedriza.tomalosbarrios.net,piedragrande.tomalosbarrios.net,pinto.tomalosbarrios.net,plazadali.tomalosbarrios.net,pontevedra.tomalaplaza.net,pozuelo.tomalosbarrios.net,prosperidad.tomalosbarrios.net,pueblonuevo.tomalosbarrios.net,pve.tomalaplaza.net,radio.takethesquare.net,retiro.tomalosbarrios.net,rivas.tomalosbarrios.net,ronda.tomalaplaza.net,salamanca.tomalaplaza.net,sanblas.tomalosbarrios.net,sanfernandodehenares.tomalosbarrios.net,sanmartindelavega.tomalosbarrios.net,santiago.tomalaplaza.net,segovia.tomalaplaza.net,sesena.tomalosbarrios.net,sevilla.tomalaplaza.net,sevilla.tomalosbarrios.net,sierranorte.tomalosbarrios.net,smvaldeiglesias.tomalosbarrios.net,soria.tomalaplaza.net,soto.tomalosbarrios.net,stamariadelaalameda.tomalosbarrios.net,stats.tomalaplaza.net,takethesquare.net,talavera.tomalaplaza.net,tcj.tomalaplaza.net,teruel.tomalaplaza.net,tetuan.tomalosbarrios.net,toledo.tomalaplaza.net,tomalaplaza.net,tomalosbarrios.net,torrejon.tomalosbarrios.net,torrelaguna.tomalosbarrios.net,torrelodones.tomalosbarrios.net,torresalameda.tomalosbarrios.net,transitionday.takethesquare.net,trescantos.tomalosbarrios.net,usera.tomalosbarrios.net,valdemorilloynavalagamella.tomalosbarrios.net,valdemoro.tomalosbarrios.net,valencia.tomalaplaza.net,vdelacanada.tomalosbarrios.net,vegadeltajuna.tomalaplaza.net,velilla.tomalosbarrios.net,vemail.tomalaplaza.net,vicalvaro.tomalosbarrios.net,vigo.tomalaplaza.net,villadevallecas.tomalosbarrios.net,villaverde.tomalosbarrios.net,wiki.tomalaplaza.net,www.tomalatele.tv,zamora.tomalaplaza.net,zaragoza.tomalaplaza.net,zaragoza.tomalosbarrios.net,zarzalejo.tomalosbarrios.net \
     --warc-file=out/15M \
     https://15hack.github.io/web-backup/out/links.html

I am doing that in one single command because I thought that generating one single warc the compression would be better than doing a different warc for each domain.我在一个命令中执行此操作,因为我认为生成一个单一的 warc 压缩比为每个域执行不同的 warc 更好。

Another point to have everything in one single warc it is being able to follow links from one site to other.将所有内容都集中在一个 warc 中的另一点是能够跟踪从一个站点到另一个站点的链接。

But this job spend 18 days and generate a 19 GB warc file.但是这项工作花费了 18 天并生成了一个 19 GB 的 warc 文件。 Also I am having problems to open this warc in some applications.此外,我在某些应用程序中打开此 warc 时遇到问题。 I think it is because of the file size.我认为这是因为文件大小。

Also I just read in https://wiki.archiveteam.org/index.php/The_WARC_Ecosystem that warc file should top out at 1 gb.此外,我刚刚在https://wiki.archiveteam.org/index.php/The_WARC_Ecosystem中读到,warc 文件的最大容量应为 1 GB。

So my question are:所以我的问题是:

  • What would be the best way to do a warc for all links listed in https://15hack.github.io/web-backup/out/links.html ?https://15hack.github.io/web-backup/out/links.html中列出的所有链接执行 warc 的最佳方法是什么?
  • Should I do several warc?我应该做几个 warc 吗?
  • If I do several warc (for example, one for each domain) how can I follow links from one site to another using the warcs?如果我执行多个 warc(例如,每个域一个)我如何使用 warc 跟踪从一个站点到另一个站点的链接?
  • Is there any wget's parameter that I could use to improve the performance and compression?有没有我可以用来提高性能和压缩的 wget 参数?

Thanks谢谢

But this job spend 18 days但是这个工作花了18天

If this is problem for you then consider preparing commands preparing one file per domain and run them in parallel.如果这对您来说是个问题,那么请考虑准备命令,为每个域准备一个文件并并行运行它们。 Note that this might but does not have to help - it should help if you have still free connection capacity (ie servers do not provide enough data to use all or almost all connection capacity).请注意,这可能但不一定有帮助 - 如果您仍有空闲连接容量(即服务器不提供足够的数据来使用所有或几乎所有连接容量),它应该会有所帮助。

Also I just read in https://wiki.archiveteam.org/index.php/The_WARC_Ecosystem that warc file should top out at 1 gb.此外,我刚刚在https://wiki.archiveteam.org/index.php/The_WARC_Ecosystem中读到,warc 文件的最大容量应为 1 GB。

If you must to comply with such requirement then you might use following wget option如果您必须遵守此类要求,那么您可以使用以下wget选项

   --warc-max-size=size
       Set the maximum size of the WARC files to size.

Is there any wget's parameter that I could use to improve the performance and compression?有没有我可以用来提高性能和压缩的 wget 参数?

I suggest reading about Options in Wget with WARC output , I suspect --no-warc-keep-log might give minimal lesser filesize, also you might experiment with --warc-tempdir=DIRECTORY if you have ability to use directory located on disk with greater write/read speed.我建议使用 WARC output 阅读 Wget 中的选项,我怀疑--no-warc-keep-log可能会提供最小的文件大小,如果您能够使用位于磁盘上的目录,您也可以尝试--warc-tempdir=DIRECTORY具有更高的写入/读取速度。

If I do several warc (for example, one for each domain) how can I follow links from one site to another using the warcs?如果我执行多个 warc(例如,每个域一个)我如何使用 warc 跟踪从一个站点到另一个站点的链接?

WARC has companion file format called CDX , it is used for indexing or in plain words holding mainly information in which WARC file data for given URL is stored. WARC 有一个名为CDX的伴随文件格式,它用于索引或简单地说主要包含存储给定 URL 的 WARC 文件数据的信息。 Each line of CDX file describe some record from WARC file, fields are space sheared, one of them is URL. Thus you should be able to find line with interesting line, using for example grep and then read in which WARC file is stored. CDX 文件的每一行描述了 WARC 文件中的一些记录,字段被空间剪切,其中一个是 URL。因此你应该能够找到有趣的行,例如grep ,然后读取存储在哪个 WARC 文件中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM