简体   繁体   English

获取特定文件,同时保持 HDFS 的目录结构

[英]Get specific files while keeping directory structure from HDFS

I have a directory structure looking like that on my HDFS system:我的 HDFS 系统上的目录结构如下所示:

/some/path
  ├─ report01
  │    ├─ file01.csv
  │    ├─ file02.csv
  │    ├─ file03.csv
  │    └─ lot_of_other_non_csv_files
  ├─ report02
  │    ├─ file01.csv
  │    ├─ file02.csv
  │    ├─ file03.csv
  │    ├─ file04.csv
  │    └─ lot_of_other_non_csv_files
  └─ report03
       ├─ file01.csv
       ├─ file02.csv
       └─ lot_of_other_non_csv_files

I would like to copy to my local system all CSV files while keeping the directory structure.我想将所有 CSV 文件复制到我的本地系统,同时保持目录结构。

I tried hdfs dfs -copyToLocal /some/path/report* but that method copies a lot of unnecessary (and quite large) files that I don't want to get.我尝试hdfs dfs -copyToLocal /some/path/report*但该方法复制了许多我不想获得的不必要(而且非常大)的文件。

I also tried hdfs dfs -copyToLocal /some/path/report*/file*.csv but this does not preserve the directory structure and HDFS complains that the files already exist when it tries to copy the files from the folder report02 .我也试过hdfs dfs -copyToLocal /some/path/report*/file*.csv但这不保留目录结构和report02抱怨它从文件夹中复制文件时报告文件已经存在。

Is there a way to get only files matching a specific pattern while still keeping the original directory structure?有没有办法只获取与特定模式匹配的文件,同时仍保持原始目录结构?

As it seems that there isn't any solution directly implemented in Hadoop, I finally ended up creating my own bash script:由于似乎在 Hadoop 中没有直接实现任何解决方案,我最终创建了自己的 bash 脚本:

#!/bin/bash

# pattern of files to get
TO_GET=("*.csv$" "*.png$")
# pattern of files/directories to avoid
TO_AVOID=("*_temporary*")

# function to join an array by a specified separator:
# usage: join_arr ";" ${array[@]}
join_arr() {
  local IFS="$1"
  shift
  echo "$*"
}

if (($# != 2))
then
    echo "There should be two parameters (path of the directory to get and destination)."
else
    # ensure that the provided path ends with a slash
    [[ "$1" != */ ]] && path="$1/" || path="$1"
    echo "Path to copy: $path"
    # ensure that the provided destination ends with a slash and append result directory name
    [[ "$2" != */ ]] && dest="$2/" || dest="$2"
    dest="$dest$(basename $path)/"
    echo "Destination: $dest"
    # get name of all files matching the patterns
    echo -n "Exploring path to find matching files... "
    readarray -t files < <(hdfs dfs -ls -R "$path" | egrep -v "$(join_arr "|" "${TO_AVOID[@]}")" | egrep "$(join_arr "|" "${TO_GET[@]}")" | awk '{print $NF}' | cut -c $((${#path}+1))-)
    echo "Done!"
    # check if at least one file found
    [ -z "$files" ]  && echo "No files matching the patern."
    # get files one by one
    for file in ${files[@]}
    do
        path_and_file="$path$file"
        dest_and_file="$dest$file"
        # make sure the directory exist on the local file system
        mkdir -p "$(dirname "$dest_and_file")"
        # get file in a separate process to be able to execute the queries in parallel
        (hdfs dfs -copyToLocal -f "$path_and_file" "$dest_and_file" && echo "$file") &
    done
    # wait for all queries to be finished
    wait
fi

You can call the script like that:您可以这样调用脚本:

$ script.sh "/some/hdfs/path/folder_to_get" "/some/local/path"

The script will create a directory folder_to_get in /some/local/path with all CSV and PNG files, respecting the directory structure.该脚本将在/some/local/path中创建一个目录folder_to_get ,其中包含所有 CSV 和 PNG 文件,尊重目录结构。

Note: If you want to get other files than CSV and PNG, just modify the variable TO_GET at the top of the script.注意:如果要获取除 CSV 和 PNG 以外的其他文件,只需修改脚本顶部的变量TO_GET You can also modify the TO_AVOID variable to filter directories that you don't want to scan even if they contains CSV or PNG files.您还可以修改TO_AVOID变量以过滤您不想扫描的目录,即使它们包含 CSV 或 PNG 文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM