简体   繁体   中英

wpull creating multiple unique captures per warc file

I'm using wpull to grab customer sites and save them as WARC files.

The issue I'm having is that, for some reason, it's creating multiple captures of sites. Sometimes it only does one, but others it ranges from 2 to 6 to 15 captures of the same site. I don't think the capture code is really the issue ...

$argv[1] = 'example.com';

$command = 'wpull '.$argv[1].' --force-directories --warc-file '.$argv[1].' --no-check-certificate --no-robots --output-file '.$argv[1].'.log --no-check-certificate --no-robots --user-agent "Mozilla 2.2" --wait 0.5 --random-wait --waitretry 600 --page-requisites --recursive --span-hosts-allow linked-pages,page-requisites --escaped-fragment --strip-session-id --sitemaps --reject-regex "/login\.php" --tries 3 --retry-connrefused --retry-dns-error --timeout 60 --delete-after -D '.$argv[1].' --max-redirect 10 --warc-cdx';

$response = shell_exec($command);

but I can't figure out either (a) what makes it pull multiple captures, or (b) how to force it to capture once.

I've tried including a database file to resume off of, in-case it was a memory issue, but that didn't make any difference, other than preventing me from doing multiple pulls in a row.

My test pool consists of 115 urls, so I can cross off the fact that it might be an issue on the website I'm pulling.

Options for wpull can be found here: https://wpull.readthedocs.io/en/master/options.html

and doc for pywb (to display the contents) can be seen here https://github.com/ikreymer/pywb

I'm 90% sure this has to do with wpull, but since I'm a warc newb I'm not crossing off that this could be something to do with adding the *.warc.gz file to the archive.

Ok, there's a weird nuance in wpull with the --recursive. If set, it will follow any and every http:// link and do a full pull. Adding in -D site.com will limit those pulls to the specified domain(s).

However this creates a weird scenario where it will follow every http(s) link to the same domain, from the same domain, and capture them... generating multiple captures of the same domain.

The --recursive tag isn't required to pull down an entire url. It's only if you want to capture everything that the website links to as well.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM