简体   繁体   English

重击_ wget _html2txt

[英]Bash _ wget _html2txt

I was asked to use wget to download multiple url saved in a file and stock them in another folder. 我被要求使用wget下载文件中保存的多个URL,并将它们存储在另一个文件夹中。 so I used this command: 所以我用了这个命令:

wget -E -i url.txt -P ~/Desktop/ProjectM2/data/crawl

but Prob number 1 the files have to be named as follow: 但是问题编号为1的文件必须命名如下:

1.html
2.html
3.html
..

and I tried manny things and I still can't do it. 我尝试了很多事情,但仍然做不到。

Prob number 2 I don't know how to change all these files in one command using html2txt -utf8 from .html to .txt and keeping also the numbers 问题编号2我不知道如何在一个命令中使用html2txt -utf8将所有这些文件从.html更改为.txt并同时保留数字

1.txt
2.txt
3.txt
..

thank you 谢谢

If in your case the order of urls in url.txt is important, that is, 1.html should contain the data of the first url, then 2.html should corresponds to the second url and so on then you can process the urls one by one. 如果您认为url.txt的url url.txt很重要,即1.html应该包含第一个url的数据,那么2.html应当对应于第二个url,依此类推,您可以处理一个url。一个。

The following script takes the desired action for each url: 以下脚本对每个网址执行所需的操作:

#!/bin/bash

infile="$1"

dest_dir="~/Desktop/ProjectM2/data/crawl"

# create html and txt dir inside dest_dir
mkdir -p "$dest_dir"/{html,txt}

c=1
while IFS='' read -r url || [[ -n "$url" ]]; do

    echo "Fetch $url into $c.html"
    wget -q -O "$dest_dir"/html/$c.html "$url"

    echo "Convert $c.html to $c.txt"
    html2text -o "$dest_dir"/txt/$c.txt "$dest_dir"/html/$c.html

    c=$(( c + 1 ))

done < "$infile"

The script accounts for an input file, in this case url.txt . 该脚本说明了一个输入文件,在本例中为url.txt It creates two directories ( html , txt ) under your destination directory ~/Desktop/ProjectM2/data/crawl in order to better organize the resulting files. 它在目标目录~/Desktop/ProjectM2/data/crawl下创建两个目录( htmltxt ),以便更好地组织生成的文件。 We read the urls from the file url.txt line by line with the help of a while loop ( Read file line by line ). 我们借助while循环从文件url.txt逐行读取url( 逐行读取文件 )。 With wget you can specify the desired output filename with the -O option, thus you can name your file as you wish, in your case a sequence number. 使用wget您可以使用-O选项指定所需的输出文件名,从而可以根据需要命名文件(在您的情况下为序列号)。 The -q option is used to remove wget messages from the command line. -q选项用于从命令行删除wget消息。 In html2text you can specify the outputfile using -o . html2text您可以使用-o指定输出文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM