[英]Bash _ wget _html2txt
I was asked to use wget
to download multiple url saved in a file and stock them in another folder. 我被要求使用
wget
下载文件中保存的多个URL,并将它们存储在另一个文件夹中。 so I used this command: 所以我用了这个命令:
wget -E -i url.txt -P ~/Desktop/ProjectM2/data/crawl
but Prob number 1 the files have to be named as follow: 但是问题编号为1的文件必须命名如下:
1.html
2.html
3.html
..
and I tried manny things and I still can't do it. 我尝试了很多事情,但仍然做不到。
Prob number 2 I don't know how to change all these files in one command using html2txt -utf8
from .html
to .txt
and keeping also the numbers 问题编号2我不知道如何在一个命令中使用
html2txt -utf8
将所有这些文件从.html
更改为.txt
并同时保留数字
1.txt
2.txt
3.txt
..
thank you 谢谢
If in your case the order of urls in url.txt
is important, that is, 1.html
should contain the data of the first url, then 2.html
should corresponds to the second url and so on then you can process the urls one by one. 如果您认为
url.txt
的url url.txt
很重要,即1.html
应该包含第一个url的数据,那么2.html
应当对应于第二个url,依此类推,您可以处理一个url。一个。
The following script takes the desired action for each url: 以下脚本对每个网址执行所需的操作:
#!/bin/bash
infile="$1"
dest_dir="~/Desktop/ProjectM2/data/crawl"
# create html and txt dir inside dest_dir
mkdir -p "$dest_dir"/{html,txt}
c=1
while IFS='' read -r url || [[ -n "$url" ]]; do
echo "Fetch $url into $c.html"
wget -q -O "$dest_dir"/html/$c.html "$url"
echo "Convert $c.html to $c.txt"
html2text -o "$dest_dir"/txt/$c.txt "$dest_dir"/html/$c.html
c=$(( c + 1 ))
done < "$infile"
The script accounts for an input file, in this case url.txt
. 该脚本说明了一个输入文件,在本例中为
url.txt
。 It creates two directories ( html
, txt
) under your destination directory ~/Desktop/ProjectM2/data/crawl
in order to better organize the resulting files. 它在目标目录
~/Desktop/ProjectM2/data/crawl
下创建两个目录( html
, txt
),以便更好地组织生成的文件。 We read the urls from the file url.txt
line by line with the help of a while loop ( Read file line by line ). 我们借助while循环从文件
url.txt
逐行读取url( 逐行读取文件 )。 With wget
you can specify the desired output filename with the -O
option, thus you can name your file as you wish, in your case a sequence number. 使用
wget
您可以使用-O
选项指定所需的输出文件名,从而可以根据需要命名文件(在您的情况下为序列号)。 The -q
option is used to remove wget messages from the command line. -q
选项用于从命令行删除wget消息。 In html2text
you can specify the outputfile using -o
. 在
html2text
您可以使用-o
指定输出文件。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.