简体   繁体   中英

Parallel download for a list of urls and renaming

I have a list of tab-separated urls, and target file names, urls_to_download.txt , for example:

first_file.jpg\thttps://www.google.co.il/images/srpr/logo11w.png
/subdir_1/second_file.jpg\thttps://www.google.co.il/images/srpr/logo12w.png
...
last_file.jpg\thttps://www.google.co.il/images/srpr/logo99w.png

which I want to download using several connections.

This I can do, for example, by:

cat urls_to_download.txt | xargs -n 1 -P 10 wget -nc

My question is, how do I get the files to have the new names I want for them, so the output dir would have:

first_file.jpg
/subdir1/second_file.jpg
...
last_file.jpg

I am guessing that something like this should work for you:

#!/bin/bash                                                                                                                                                                       
while read FILENAME URL; do
  wget -nc -O "$FILENAME" "$URL"
done <input.txt

where input.txt is a file which contains tab separated file/url pairs, one per line.

  1. Note that the file names in your file are using an absolute path. So you'd better rewrite those names to a relative path.

  2. In shell, only using & to put a process background can make your work parallel.

For example, if you want to be parallel, you do something like this:

#!/bin/bash
while read FILENAME URL
do
    wget -nc -O "./$FILENAME" "$URL" &   # So `wget` runs in background
done < input.txt

NOTE : The above script is just a hint and will create too many parallel wget processes if you have a lot of lines in input.txt . There are some ways to control the number of parallel tasks, which however are more or less complicated to a shell script.

A very simple way to control the number of parallel tasks, which ensures that there are at most 20 wget processes.

#!/bin/bash
NUMBER=0
while read FILENAME URL
do
    wget -nc -O "./$FILENAME" "$URL" &   # So `wget` runs in background
    NUMBER=$((NUMBER + 1))
    if [ $NUMBER -gt 20 ]
    then
        wait   # wait all background process to finish
        NUMBER=0
    fi
done < input.txt
wait

However, this method is so simple that it is not the most efficient and accurate way to control the number of parallel tasks.

try this command to download you files concurrently:

`cut -f 2 urls_to_download.txt | wget -i -;` 

`cut -f 2 urls_to_download.txt | sed 's/.*\///' | while read f; do mv $f $(cut -f 1 urls_to_download.txt); done`

I can't find a way to rename the file properly with the wget option and you need to modify to make sure the directory exists in the mv command.

Simply use wget 's -x option:

-x
--force-directories
The opposite of -nd---create a hierarchy of directories, even if one would not have been created
otherwise. Eg wget -x http://fly.srk.fer.hr/robots.txt will save the downloaded file to
fly.srk.fer.hr/robots.txt.

xargs -n 1 -P 10 wget -nc < urls_to_download.txt

If your file is tab-delimited:

xargs -n 1 -d $'\t' -P 10 wget -nc -x < urls_to_download.txt

Or perhaps you can convert tabs to newlines:

sed -e 's|\t|\n|g' urls_to_download.txt | xargs -n 1 -P 10 wget -nc -x

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM