简体   繁体   中英

Parallel wget download files does not exit properly

I am trying to download files from a file (test.txt) containing links (over 15 000+).

I have this script:

#!/bin/bash

function download {

FILE=$1

while read line; do
        url=$line

        wget -nc -P ./images/ $url

        #downloading images which are not in the test.txt, 
        #by guessing name: 12345_001.jpg, 12345_002.jpg..12345_005.jpg etc.

        wget -nc  -P ./images/ ${url%.jpg}_{001..005}.jpg
done < $FILE

}  

#test.txt contains the URLs
split -l 1000 ./temp/test.txt ./temp/split

#read splitted files and pass to the download function
for f in ./temp/split*; do
    download $f &
done

test.txt:

http://xy.com/12345.jpg
http://xy.com/33442.jpg
...

I am splitting the file into few pieces and daemonize ( download $f & ) the wget process so it can jump to another splitted file containing the links.

Script is working, but the script does not exit at the end, I must press enter at the end. If I remove & from the line download $f & it works, but I loose the parallel downloading.

Edit:

As I found this is not the best way to parallelize wget downloads. It would be great to use GNU Parallel.

在此处输入图片说明

The script is exiting, but the wget processes in the background are producing output after the script exits, and this gets printed after the shell prompt. So you need to press Enter to get another prompt.

Use the -q option to wget to turn off output.

while read line; do
        url=$line
        wget -ncq -P ./images/ "$url"
        wget -ncq  -P ./images/ "${url%.jpg}"_{001..005}.jpg
done < "$FILE"

May I commend GNU Parallel to you?

parallel --dry-run -j32 -a URLs.txt 'wget -ncq -P ./images/ {}; wget -ncq  -P ./images/ {.}_{001..005}.jpg'

I am only guessing what your input file looks like in URLs.txt as something resembling:

http://somesite.com/image1.jpg
http://someothersite.com/someotherimage.jpg

Or, using your own approach with a function:

#/bin/bash

# define and export a function for "parallel" to call
doit(){
   wget -ncq -P ./images/ "$1"
   wget -ncq -P ./images/ "$2_{001..005}.jpg"
}
export -f doit

parallel --dry-run  -j32 -a URLs.txt doit {} {.}

@Barmar's answer is correct. However, I would like to present a different, more efficient solution. You could look into using Wget2 .

Wget2 is the next major version of GNU Wget. It comes with many new features, including multi threaded downloading. So, with GNU wget2, all you would need to do is pass the --max-threads option and select the number of parallel threads you want to spawn.

You can compile it from the git repository very easily. There also exist packages for Arch Linux on the AUR and in Debian

EDIT: Full Disclosure: I am one of the maintainers of GNU Wget and GNU Wget2.

  1. Please read wget manual page/ help.

Logging and input file:

-i, --input-file=FILE download URLs found in local or external FILE.

  -o,  --output-file=FILE    log messages to FILE.
  -a,  --append-output=FILE  append messages to FILE.
  -d,  --debug               print lots of debugging information.
  -q,  --quiet               quiet (no output).
  -v,  --verbose             be verbose (this is the default).
  -nv, --no-verbose          turn off verboseness, without being quiet.
       --report-speed=TYPE   Output bandwidth as TYPE.  TYPE can be bits.
  -i,  --input-file=FILE     download URLs found in local or external FILE.
  -F,  --force-html          treat input file as HTML.
  -B,  --base=URL            resolves HTML input-file links (-i -F)
                             relative to URL.
       --config=FILE         Specify config file to use.

Download:

-nc, --no-clobber skip downloads that would download to existing files (overwriting them).

  -t,  --tries=NUMBER            set number of retries to NUMBER (0 unlimits).
       --retry-connrefused       retry even if connection is refused.
  -O,  --output-document=FILE    write documents to FILE.
  -nc, --no-clobber              skip downloads that would download to
                                 existing files (overwriting them).
  -c,  --continue                resume getting a partially-downloaded file.
       --progress=TYPE           select progress gauge type.
  -N,  --timestamping            don't re-retrieve files unless newer than
                                 local.
  --no-use-server-timestamps     don't set the local file's timestamp by
                                 the one on the server.
  -S,  --server-response         print server response.
       --spider                  don't download anything.
  -T,  --timeout=SECONDS         set all timeout values to SECONDS.
       --dns-timeout=SECS        set the DNS lookup timeout to SECS.
       --connect-timeout=SECS    set the connect timeout to SECS.
       --read-timeout=SECS       set the read timeout to SECS.
  -w,  --wait=SECONDS            wait SECONDS between retrievals.
       --waitretry=SECONDS       wait 1..SECONDS between retries of a retrieval.
       --random-wait             wait from 0.5*WAIT...1.5*WAIT secs between retrievals.
       --no-proxy                explicitly turn off proxy.
  -Q,  --quota=NUMBER            set retrieval quota to NUMBER.
       --bind-address=ADDRESS    bind to ADDRESS (hostname or IP) on local host.
       --limit-rate=RATE         limit download rate to RATE.
       --no-dns-cache            disable caching DNS lookups.
       --restrict-file-names=OS  restrict chars in file names to ones OS allows.
       --ignore-case             ignore case when matching files/directories.
  -4,  --inet4-only              connect only to IPv4 addresses.
  -6,  --inet6-only              connect only to IPv6 addresses.
       --prefer-family=FAMILY    connect first to addresses of specified family,
                                 one of IPv6, IPv4, or none.
       --user=USER               set both ftp and http user to USER.
       --password=PASS           set both ftp and http password to PASS.
       --ask-password            prompt for passwords.
       --no-iri                  turn off IRI support.
       --local-encoding=ENC      use ENC as the local encoding for IRIs.
       --remote-encoding=ENC     use ENC as the default remote encoding.
       --unlink                  remove file before clobber.       
  1. Follow how to wait wget finished to get more resources

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM