I am trying to download files from a file (test.txt) containing links (over 15 000+).
I have this script:
#!/bin/bash
function download {
FILE=$1
while read line; do
url=$line
wget -nc -P ./images/ $url
#downloading images which are not in the test.txt,
#by guessing name: 12345_001.jpg, 12345_002.jpg..12345_005.jpg etc.
wget -nc -P ./images/ ${url%.jpg}_{001..005}.jpg
done < $FILE
}
#test.txt contains the URLs
split -l 1000 ./temp/test.txt ./temp/split
#read splitted files and pass to the download function
for f in ./temp/split*; do
download $f &
done
test.txt:
http://xy.com/12345.jpg
http://xy.com/33442.jpg
...
I am splitting the file into few pieces and daemonize ( download $f &
) the wget process so it can jump to another splitted file containing the links.
Script is working, but the script does not exit at the end, I must press enter at the end. If I remove &
from the line download $f &
it works, but I loose the parallel downloading.
Edit:
As I found this is not the best way to parallelize wget downloads. It would be great to use GNU Parallel.
The script is exiting, but the wget
processes in the background are producing output after the script exits, and this gets printed after the shell prompt. So you need to press Enter to get another prompt.
Use the -q
option to wget
to turn off output.
while read line; do
url=$line
wget -ncq -P ./images/ "$url"
wget -ncq -P ./images/ "${url%.jpg}"_{001..005}.jpg
done < "$FILE"
May I commend GNU Parallel to you?
parallel --dry-run -j32 -a URLs.txt 'wget -ncq -P ./images/ {}; wget -ncq -P ./images/ {.}_{001..005}.jpg'
I am only guessing what your input file looks like in URLs.txt
as something resembling:
http://somesite.com/image1.jpg
http://someothersite.com/someotherimage.jpg
Or, using your own approach with a function:
#/bin/bash
# define and export a function for "parallel" to call
doit(){
wget -ncq -P ./images/ "$1"
wget -ncq -P ./images/ "$2_{001..005}.jpg"
}
export -f doit
parallel --dry-run -j32 -a URLs.txt doit {} {.}
@Barmar's answer is correct. However, I would like to present a different, more efficient solution. You could look into using Wget2 .
Wget2 is the next major version of GNU Wget. It comes with many new features, including multi threaded downloading. So, with GNU wget2, all you would need to do is pass the --max-threads
option and select the number of parallel threads you want to spawn.
You can compile it from the git repository very easily. There also exist packages for Arch Linux on the AUR and in Debian
EDIT: Full Disclosure: I am one of the maintainers of GNU Wget and GNU Wget2.
Logging and input file:
-i, --input-file=FILE download URLs found in local or external FILE.
-o, --output-file=FILE log messages to FILE.
-a, --append-output=FILE append messages to FILE.
-d, --debug print lots of debugging information.
-q, --quiet quiet (no output).
-v, --verbose be verbose (this is the default).
-nv, --no-verbose turn off verboseness, without being quiet.
--report-speed=TYPE Output bandwidth as TYPE. TYPE can be bits.
-i, --input-file=FILE download URLs found in local or external FILE.
-F, --force-html treat input file as HTML.
-B, --base=URL resolves HTML input-file links (-i -F)
relative to URL.
--config=FILE Specify config file to use.
Download:
-nc, --no-clobber skip downloads that would download to existing files (overwriting them).
-t, --tries=NUMBER set number of retries to NUMBER (0 unlimits).
--retry-connrefused retry even if connection is refused.
-O, --output-document=FILE write documents to FILE.
-nc, --no-clobber skip downloads that would download to
existing files (overwriting them).
-c, --continue resume getting a partially-downloaded file.
--progress=TYPE select progress gauge type.
-N, --timestamping don't re-retrieve files unless newer than
local.
--no-use-server-timestamps don't set the local file's timestamp by
the one on the server.
-S, --server-response print server response.
--spider don't download anything.
-T, --timeout=SECONDS set all timeout values to SECONDS.
--dns-timeout=SECS set the DNS lookup timeout to SECS.
--connect-timeout=SECS set the connect timeout to SECS.
--read-timeout=SECS set the read timeout to SECS.
-w, --wait=SECONDS wait SECONDS between retrievals.
--waitretry=SECONDS wait 1..SECONDS between retries of a retrieval.
--random-wait wait from 0.5*WAIT...1.5*WAIT secs between retrievals.
--no-proxy explicitly turn off proxy.
-Q, --quota=NUMBER set retrieval quota to NUMBER.
--bind-address=ADDRESS bind to ADDRESS (hostname or IP) on local host.
--limit-rate=RATE limit download rate to RATE.
--no-dns-cache disable caching DNS lookups.
--restrict-file-names=OS restrict chars in file names to ones OS allows.
--ignore-case ignore case when matching files/directories.
-4, --inet4-only connect only to IPv4 addresses.
-6, --inet6-only connect only to IPv6 addresses.
--prefer-family=FAMILY connect first to addresses of specified family,
one of IPv6, IPv4, or none.
--user=USER set both ftp and http user to USER.
--password=PASS set both ftp and http password to PASS.
--ask-password prompt for passwords.
--no-iri turn off IRI support.
--local-encoding=ENC use ENC as the local encoding for IRIs.
--remote-encoding=ENC use ENC as the default remote encoding.
--unlink remove file before clobber.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.