Parallel download for a list of urls and renaming

Question

I have a list of tab-separated urls, and target file names, urls_to_download.txt , for example:

first_file.jpg\thttps://www.google.co.il/images/srpr/logo11w.png
/subdir_1/second_file.jpg\thttps://www.google.co.il/images/srpr/logo12w.png
...
last_file.jpg\thttps://www.google.co.il/images/srpr/logo99w.png

which I want to download using several connections.

This I can do, for example, by:

cat urls_to_download.txt | xargs -n 1 -P 10 wget -nc

My question is, how do I get the files to have the new names I want for them, so the output dir would have:

first_file.jpg
/subdir1/second_file.jpg
...
last_file.jpg

Answer 1

I am guessing that something like this should work for you:

#!/bin/bash                                                                                                                                                                       
while read FILENAME URL; do
  wget -nc -O "$FILENAME" "$URL"
done <input.txt

where input.txt is a file which contains tab separated file/url pairs, one per line.

Answer 2

Note that the file names in your file are using an absolute path. So you'd better rewrite those names to a relative path.
In shell, only using & to put a process background can make your work parallel.

For example, if you want to be parallel, you do something like this:

#!/bin/bash
while read FILENAME URL
do
    wget -nc -O "./$FILENAME" "$URL" &   # So `wget` runs in background
done < input.txt

NOTE : The above script is just a hint and will create too many parallel wget processes if you have a lot of lines in input.txt . There are some ways to control the number of parallel tasks, which however are more or less complicated to a shell script.

A very simple way to control the number of parallel tasks, which ensures that there are at most 20 wget processes.

#!/bin/bash
NUMBER=0
while read FILENAME URL
do
    wget -nc -O "./$FILENAME" "$URL" &   # So `wget` runs in background
    NUMBER=$((NUMBER + 1))
    if [ $NUMBER -gt 20 ]
    then
        wait   # wait all background process to finish
        NUMBER=0
    fi
done < input.txt
wait

However, this method is so simple that it is not the most efficient and accurate way to control the number of parallel tasks.

Answer 3

try this command to download you files concurrently:

`cut -f 2 urls_to_download.txt | wget -i -;` 

`cut -f 2 urls_to_download.txt | sed 's/.*\///' | while read f; do mv $f $(cut -f 1 urls_to_download.txt); done`

I can't find a way to rename the file properly with the wget option and you need to modify to make sure the directory exists in the mv command.

Answer 4

Simply use wget 's -x option:

-x
--force-directories
The opposite of -nd---create a hierarchy of directories, even if one would not have been created
otherwise. Eg wget -x http://fly.srk.fer.hr/robots.txt will save the downloaded file to
fly.srk.fer.hr/robots.txt.

xargs -n 1 -P 10 wget -nc < urls_to_download.txt

If your file is tab-delimited:

xargs -n 1 -d $'\t' -P 10 wget -nc -x < urls_to_download.txt

Or perhaps you can convert tabs to newlines:

sed -e 's|\t|\n|g' urls_to_download.txt | xargs -n 1 -P 10 wget -nc -x

Parallel download for a list of urls and renaming

Question

4 answers

solution1
3 2014-07-14 08:24:06

solution2
1 2014-07-14 08:37:31

solution3
0 2014-07-14 08:37:19

solution4
0 2014-07-14 10:19:02

Parallel download for a list of urls and renaming

Question

4 answers

solution1 3 2014-07-14 08:24:06

solution2 1 2014-07-14 08:37:31

solution3 0 2014-07-14 08:37:19

solution4 0 2014-07-14 10:19:02

solution1
3 2014-07-14 08:24:06

solution2
1 2014-07-14 08:37:31

solution3
0 2014-07-14 08:37:19

solution4
0 2014-07-14 10:19:02