简体   繁体   中英

Creating a wget Bash Script

Im creating a wget script to download & mirror a site . The URLs are taken from a text file. I have nearly created the whole script, but now I need to make it perfect. It is to be used for 3 hours every day, so it should continue where it last ended.
I have provided my script below, if anyone finds it useful may use it but keep my name in the script.

Problems with script:

The script is not referencing its links correctly by making it referable to the file in the parent directory, please tell me about that.
The Script is not resuming after being aborted in the middle even with the --continue parameter

#       Created by Salik Sadruddin Merani
#       email: ssm14293@gmail.com
#       site: http://www.dragotech-innovations.tk
clear
echo '  Created by: Salik Sadruddin Merani'
echo '  email: ssm14293@gmail.com'
echo '  site: http://www.dragotech-innovations.tk'
echo
echo '  Info:'
echo '  This script will use the URLs provided in the File "urls.txt"'
echo '  Info: Logs will be saved in logfile.txt'
echo '  URLs are taken from the urls.txt file'
#
url=`< ./urls.txt`
useragent='Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0'
echo '  Mozilla Firefox User agent will be used'

cred='log=abc@123.org&pwd=abc123&wp-submit=Log In&redirect_to=http://abc@123.org/wp-admin/&testcookie=1'
echo '  Loaded Credentails'
echo '  Logging In'
wget --save-cookies cookies.txt --post-data ${cred} --keep-session-cookies http://members.ebenpagan.com/wp-login.php --delete-after

OIFS=$IFS
IFS=','
arr2=$url
for x in $arr2
do
    echo '      Loading Cookies'
    wget --spider --load-cookies cookies.txt --keep-session-cookies --mirror --convert-links --page-requisites ${x} -U ${useragent} -np --adjust-extension --continue -e robots=no --span-hosts --no-parent -o log-file-$x.txt
done
IFS=$OIFS

Regards

The --continue flag in wget will attempt to resume the downloading of a single file in the current directory. Please refer to the man page of wget for more info. It is quite detailed.

What you need is resuming the mirroring/downloading from where the script previously left off.

So, its more of a modification of script than some setting in wget. I can suggest a way to do that, but mind you, you can use a different approach as well.

Modify the URLs.txt file to have one URL per line. Then refer this pseudocode -

  1. get the url from the file
  2. if (url ends with a token #DONE), continue
  3. else, wget command
  4. append a token #DONE to the end of the url in the file

This way, you will know which URL to continue from, the next time you run the script. All URLs that have a "#DONE" at the end will be skipped, and the rest will be downloaded.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM