简体   繁体   中英

Bash _ wget _html2txt

I was asked to use wget to download multiple url saved in a file and stock them in another folder. so I used this command:

wget -E -i url.txt -P ~/Desktop/ProjectM2/data/crawl

but Prob number 1 the files have to be named as follow:

1.html
2.html
3.html
..

and I tried manny things and I still can't do it.

Prob number 2 I don't know how to change all these files in one command using html2txt -utf8 from .html to .txt and keeping also the numbers

1.txt
2.txt
3.txt
..

thank you

If in your case the order of urls in url.txt is important, that is, 1.html should contain the data of the first url, then 2.html should corresponds to the second url and so on then you can process the urls one by one.

The following script takes the desired action for each url:

#!/bin/bash

infile="$1"

dest_dir="~/Desktop/ProjectM2/data/crawl"

# create html and txt dir inside dest_dir
mkdir -p "$dest_dir"/{html,txt}

c=1
while IFS='' read -r url || [[ -n "$url" ]]; do

    echo "Fetch $url into $c.html"
    wget -q -O "$dest_dir"/html/$c.html "$url"

    echo "Convert $c.html to $c.txt"
    html2text -o "$dest_dir"/txt/$c.txt "$dest_dir"/html/$c.html

    c=$(( c + 1 ))

done < "$infile"

The script accounts for an input file, in this case url.txt . It creates two directories ( html , txt ) under your destination directory ~/Desktop/ProjectM2/data/crawl in order to better organize the resulting files. We read the urls from the file url.txt line by line with the help of a while loop ( Read file line by line ). With wget you can specify the desired output filename with the -O option, thus you can name your file as you wish, in your case a sequence number. The -q option is used to remove wget messages from the command line. In html2text you can specify the outputfile using -o .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM