简体   繁体   中英

Optimize WARC generation in order to save space and time

I am trying to do a WARC file of a very large list of links of several domains like that:

wget --no-check-certificate \
     --no-verbose \
     --execute robots=off \
     --delete-after \
     --no-directories \
     --page-requisites \
     --mirror \
     --no-warc-keep-log \
     --output-file=out/15M.log \
     --warc-cdx \
     --span-hosts \
     --domains=15hack.tomalaplaza.net,15m20.tomalaplaza.net,15oct.takethesquare.net,actasmadrid.tomalaplaza.net,alcala.tomalaplaza.net,alcorcon.tomalosbarrios.net,alcosanse.tomalosbarrios.net,alicante.tomalaplaza.net,aluche.tomalosbarrios.net,andorra.tomalaplaza.net,antibanks.takethesquare.net,aragon.tomalaplaza.net,aravaca.tomalosbarrios.net,arganzuela.tomalosbarrios.net,arroyomolinos.tomalosbarrios.net,asambleademostoles.tomalosbarrios.net,asambleaplayasalicante.tomalosbarrios.net,asamblea-sanlorenzo-escorial.tomalosbarrios.net,austrias.tomalosbarrios.net,aviles.tomalaplaza.net,barajas.tomalosbarrios.net,barcelona.tomalaplaza.net,barriodelpilar.tomalosbarrios.net,barriosdelsur.tomalosbarrios.net,batan.tomalosbarrios.net,becerril.tomalosbarrios.net,benicarlo.tomalaplaza.net,berlinbienal.tomalaplaza.net,bilbao.tomalaplaza.net,boadilladelmonte.tomalosbarrios.net,boalo.tomalosbarrios.net,burgos.tomalaplaza.net,caceres.tomalaplaza.net,cadiz.tomalaplaza.net,canadareal.tomalosbarrios.net,castellon.tomalaplaza.net,cercedilla.tomalosbarrios.net,chamartin.tomalosbarrios.net,chapineria.tomalosbarrios.net,chiclana.tomalaplaza.net,chueca.tomalosbarrios.net,ciempozuelos.tomalosbarrios.net,ciudadlineal.tomalosbarrios.net,colladomediano.tomalosbarrios.net,colladovillalba.tomalosbarrios.net,colmenarejo.tomalosbarrios.net,colmenarviejo.tomalosbarrios.net,compostela.tomalaplaza.net,comunicacionestatal15m.tomalaplaza.net,contralaviolenciadegenero.tomalaplaza.net,cordoba.tomalaplaza.net,coslada.tomalosbarrios.net,daganzodearriba.tomalosbarrios.net,debatedelpueblo.tomalosbarrios.net,debatepopular.tomalosbarrios.net,dec10.takethesquare.net,desmontandomentiras.tomalaplaza.net,donostia.tomalaplaza.net,dosdemayo.tomalosbarrios.net,economia.tomalaplaza.net,elalamo.tomalosbarrios.net,elche.tomalaplaza.net,elejido.tomalosbarrios.net,enbustarviejo.tomalosbarrios.net,encuentro15m.tomalaplaza.net,foro.tomalosbarrios.net,fuencarral.tomalosbarrios.net,fuenlabrada.tomalosbarrios.net,galapagar.tomalosbarrios.net,gamonal.tomalosbarrios.net,gasteiz.tomalaplaza.net,getafe.tomalosbarrios.net,granada.tomalaplaza.net,grancanaria.tomalosbarrios.net,guadalixdelasierra.tomalosbarrios.net,guadarrama.tomalosbarrios.net,guindalera.tomalosbarrios.net,hacksol.tomalaplaza.net,hortaleza.tomalosbarrios.net,howtocamp.takethesquare.net,hoyodemanzanares.tomalosbarrios.net,ibiza.tomalaplaza.net,jerez.tomalaplaza.net,jitsi.tomalaplaza.net,laconce.tomalosbarrios.net,laelipa.tomalosbarrios.net,lasmatas.tomalosbarrios.net,laspalmas.tomalaplaza.net,lasrozas.tomalosbarrios.net,lastablassanchinarro.tomalosbarrios.net,lavapies.tomalosbarrios.net,leganes.tomalosbarrios.net,leon.tomalaplaza.net,letras.tomalosbarrios.net,listas.tomalaplaza.net,listas.tomalosbarrios.net,lists.takethesquare.net,lleida.tomalaplaza.net,logrono.tomalaplaza.net,lucero.tomalosbarrios.net,madrid15m.org,madridocm.tomalaplaza.net,madridsur.tomalosbarrios.net,madrid.tomalaplaza.net,madrid.tomalosbarrios.net,majadahonda.tomalosbarrios.net,malaga.tomalaplaza.net,marchestobrussels.takethesquare.net,mataro.tomalosbarrios.net,mayo2013.tomalaplaza.net,mejoradadelcampo.tomalosbarrios.net,menorca.tomalaplaza.net,miraflores.tomalosbarrios.net,montecarmelo.tomalosbarrios.net,moralzarzal.tomalosbarrios.net,mumble.tomalaplaza.net,navalafuente.tomalosbarrios.net,nudomanoteras.tomalosbarrios.net,nuevobaztan.tomalosbarrios.net,ocmdaganzo.tomalaplaza.net,optt.tomalaplaza.net,ourense.tomalaplaza.net,oviedo.tomalaplaza.net,pads.tomalaplaza.net,pamplona.tomalaplaza.net,paracuellos.tomalosbarrios.net,parla.tomalosbarrios.net,parlaverde.tomalosbarrios.net,paseoextremadura.tomalosbarrios.net,pedrezuela.tomalosbarrios.net,pedriza.tomalosbarrios.net,piedragrande.tomalosbarrios.net,pinto.tomalosbarrios.net,plazadali.tomalosbarrios.net,pontevedra.tomalaplaza.net,pozuelo.tomalosbarrios.net,prosperidad.tomalosbarrios.net,pueblonuevo.tomalosbarrios.net,pve.tomalaplaza.net,radio.takethesquare.net,retiro.tomalosbarrios.net,rivas.tomalosbarrios.net,ronda.tomalaplaza.net,salamanca.tomalaplaza.net,sanblas.tomalosbarrios.net,sanfernandodehenares.tomalosbarrios.net,sanmartindelavega.tomalosbarrios.net,santiago.tomalaplaza.net,segovia.tomalaplaza.net,sesena.tomalosbarrios.net,sevilla.tomalaplaza.net,sevilla.tomalosbarrios.net,sierranorte.tomalosbarrios.net,smvaldeiglesias.tomalosbarrios.net,soria.tomalaplaza.net,soto.tomalosbarrios.net,stamariadelaalameda.tomalosbarrios.net,stats.tomalaplaza.net,takethesquare.net,talavera.tomalaplaza.net,tcj.tomalaplaza.net,teruel.tomalaplaza.net,tetuan.tomalosbarrios.net,toledo.tomalaplaza.net,tomalaplaza.net,tomalosbarrios.net,torrejon.tomalosbarrios.net,torrelaguna.tomalosbarrios.net,torrelodones.tomalosbarrios.net,torresalameda.tomalosbarrios.net,transitionday.takethesquare.net,trescantos.tomalosbarrios.net,usera.tomalosbarrios.net,valdemorilloynavalagamella.tomalosbarrios.net,valdemoro.tomalosbarrios.net,valencia.tomalaplaza.net,vdelacanada.tomalosbarrios.net,vegadeltajuna.tomalaplaza.net,velilla.tomalosbarrios.net,vemail.tomalaplaza.net,vicalvaro.tomalosbarrios.net,vigo.tomalaplaza.net,villadevallecas.tomalosbarrios.net,villaverde.tomalosbarrios.net,wiki.tomalaplaza.net,www.tomalatele.tv,zamora.tomalaplaza.net,zaragoza.tomalaplaza.net,zaragoza.tomalosbarrios.net,zarzalejo.tomalosbarrios.net \
     --warc-file=out/15M \
     https://15hack.github.io/web-backup/out/links.html

I am doing that in one single command because I thought that generating one single warc the compression would be better than doing a different warc for each domain.

Another point to have everything in one single warc it is being able to follow links from one site to other.

But this job spend 18 days and generate a 19 GB warc file. Also I am having problems to open this warc in some applications. I think it is because of the file size.

Also I just read in https://wiki.archiveteam.org/index.php/The_WARC_Ecosystem that warc file should top out at 1 gb.

So my question are:

  • What would be the best way to do a warc for all links listed in https://15hack.github.io/web-backup/out/links.html ?
  • Should I do several warc?
  • If I do several warc (for example, one for each domain) how can I follow links from one site to another using the warcs?
  • Is there any wget's parameter that I could use to improve the performance and compression?

Thanks

But this job spend 18 days

If this is problem for you then consider preparing commands preparing one file per domain and run them in parallel. Note that this might but does not have to help - it should help if you have still free connection capacity (ie servers do not provide enough data to use all or almost all connection capacity).

Also I just read in https://wiki.archiveteam.org/index.php/The_WARC_Ecosystem that warc file should top out at 1 gb.

If you must to comply with such requirement then you might use following wget option

   --warc-max-size=size
       Set the maximum size of the WARC files to size.

Is there any wget's parameter that I could use to improve the performance and compression?

I suggest reading about Options in Wget with WARC output , I suspect --no-warc-keep-log might give minimal lesser filesize, also you might experiment with --warc-tempdir=DIRECTORY if you have ability to use directory located on disk with greater write/read speed.

If I do several warc (for example, one for each domain) how can I follow links from one site to another using the warcs?

WARC has companion file format called CDX , it is used for indexing or in plain words holding mainly information in which WARC file data for given URL is stored. Each line of CDX file describe some record from WARC file, fields are space sheared, one of them is URL. Thus you should be able to find line with interesting line, using for example grep and then read in which WARC file is stored.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM