简体   繁体   English

从URL列表下载并输出到相关目录

[英]Download from list of URLs and output to relative directories

I have a list of URLs in a text file: 我在文本文件中有一个URL列表:

http://host/index.html
http://host/js/test.js
http://host/js/sub/test_sub.js
http://host/css/test.css

I would like to download these files by replicating the same tree on my filesystem. 我想通过在文件系统上复制同一棵树来下载这些文件。 For example, I would like to end with the following tree when I'm done: 例如,完成后,我想以下面的树结尾:

wd/
 |_index.html
 |_js/
 |  |_test.js
 |  |_sub/
 |     |_test_sub.js/
 |_css/
    |_test.css

Here's what I've tried: 这是我尝试过的:

Add target file as second argument in list: 将目标文件添加为列表中的第二个参数:

http://host/index.html 
http://host/js/test.js js/test.js
http://host/js/sub/test_sub.js js/sub/test_sub.js
http://host/css/test.css css/test.css

Use a while loop to tell wget where to save these: 使用while循环告诉wget将它们保存在哪里:

 while read url target; do
   wget "$url" -P "$target";
 done < site_media_list.txt 

This didn't work, the end result was all files in same directory, without new directories. 这不起作用,最终结果是所有文件都在同一目录中,没有新目录。

Make a file with list of only links (no paths), one on each line, then wget -nH -x -i links_list.txt downloads files to working directory keeping the directory structure intact. 制作一个仅包含链接列表(无路径)的文件,每行一个,然后wget -nH -x -i links_list.txt将文件下载到工作目录中,从而保持目录结构完整。 A more readable version of the same command is given below. 下面给出了同一命令的可读性更高的版本。

wget --no-host-directories --force-directories --input-file=links_list.txt

Wget has many flexible options for directories. Wget有许多灵活的目录选项。 Look up man wget directory options for more. 查找man wget目录选项以获取更多信息。

Assuming your file site_media_list.txt is containing only the files list (and not target directories), you should be able to parse out the directory names from the URL: 假设您的文件site_media_list.txt仅包含文件列表(而不包含目标目录),则您应该能够从URL中解析出目录名称:

while read -r url ; do
  s=$(echo "$url" | sed -E 's#http://host/(.*/)?.*$#\1#')
  if [[ -z "$s" ]]; then
    echo "working dir"
    wget "$url"
  else
    echo "subdir"
    mkdir -p "$s"
    wget $url -P "$s"
  fi
done < site_media_list.txt

It looks like the main problem you were having is that you were passing the directory name and filename to wget - you only need to pass the directory name - wget will calculate the filename from the URL. 看来您遇到的主要问题是您要将目录名文件名传递给wget您只需要传递目录名wget将从URL中计算文件名。

Split the path on / into an array, use only the relevant elements to create the path. /上的路径拆分为一个数组,仅使用相关元素创建路径。

#!/bin/bash
while read url ; do
    IFS=/ parts=($url)
    if (( ${#parts[@]} > 4 )) ; then
        IFS=/ path="${parts[*]:3:${#parts[@]}-4}"
        mdkir -p "$path"
    fi
    IFS=/ wget -O "${parts[*]:3}" "$url"
done

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM