简体   繁体   中英

Download multiple files fast wget

I want to download several webpages using wget, and for that I'm using the following BASH:

wget -x --load-cookies cookies.txt http://www.example.com/1
wget -x --load-cookies cookies.txt http://www.example.com/2
wget -x --load-cookies cookies.txt http://www.example.com/3
wget -x --load-cookies cookies.txt http://www.example.com/4
wget -x --load-cookies cookies.txt http://www.example.com/5
wget -x --load-cookies cookies.txt http://www.example.com/6
wget -x --load-cookies cookies.txt http://www.example.com/7
wget -x --load-cookies cookies.txt http://www.example.com/8

And using Cygwin:

sh download.sh

However, each time I download a file it reconnects to the server and that takes time, is there a more efficient way to massively download files from the same server (example.com/...)?

You could try mget . It's basically a multithreaded wget .

I agree with some of the previous answers of opening new processes so that the commands run in parallel. That being said, whenever I do stuff like this, I use an extremely handy tool (which also works with Cygwin), and that tool would be GNU Parallel .

After installing parallel , from your example, I would run the following:

$ for i in {1..8}; do echo $i; done | parallel -j+0 wget -x --load-cookies cookies.txt http://www.example.com/{}
  • The for loop is just feeding the different parameters line-by-line into parallel . There are multiple ways you can do this, but this is just one example.
  • -j+0 tells parallel to spread each job out across as many cores as you have. man parallel will explain more options, and it's extremely tweakable. You can have a look and tweak it to your specifications.

So basically, if you have 4 cores, and run the top command, you will see 4 separate wget processes running simultaneously. As soon as one exits, another one starts until all 8 jobs have finished.

Since we are mainly concerned with web sockets and not necessarily processing, the other solutions may work better, but this is simply one easy way of going about accomplishing what you are attempting, and like I said, parallel is extremely feature-rich, so you could possibly tweak that command to make it even better/faster.

It's definitely worth experimenting with, because for instance, I'm not exactly sure what would happen if you split it up into 2 parallel jobs -- which may be the perfect answer on a 4-core system:

$ for i in {1..4}; do echo $i; done | parallel -j+0 wget -x --load-cookies cookies.txt http://www.example.com/{}
$ for i in {5..8}; do echo $i; done | parallel -j+0 wget -x --load-cookies cookies.txt http://www.example.com/{}

You'd still have to run these commands in subshells so that they don't execute sequentially (using (...)& and whatnot, as some others suggested. Someone please correct me if I'm wrong, but it would probably look something like this:

$ (for i in {1..4}; do echo $i; done | parallel -j+0 wget -x --load-cookies cookies.txt http://www.example.com/{})&
$ (for i in {5..8}; do echo $i; done | parallel -j+0 wget -x --load-cookies cookies.txt http://www.example.com/{})&

The pseudo output from top would probably look something like this:

wget
wget
wget
wget
parallel
wget
wget
wget
wget
parallel

All that being said, I've never used mget , which may actuall be the correct tool for the job. The response regarding Aria2 was a little off, but they were correct in stating that it is a command line download tool capable of multithreaded downloading.

Not using wget. Wget is still sequential meaning that it starts a file, downloads it in parts until it's done and disconnects. There's no way here to download all the files on the same connection. You might use something like Aria2c to do this, but I'm not sure how much of an improvement you'd get.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM