[英]Get specific files while keeping directory structure from HDFS
I have a directory structure looking like that on my HDFS system:我的 HDFS 系统上的目录结构如下所示:
/some/path
├─ report01
│ ├─ file01.csv
│ ├─ file02.csv
│ ├─ file03.csv
│ └─ lot_of_other_non_csv_files
├─ report02
│ ├─ file01.csv
│ ├─ file02.csv
│ ├─ file03.csv
│ ├─ file04.csv
│ └─ lot_of_other_non_csv_files
└─ report03
├─ file01.csv
├─ file02.csv
└─ lot_of_other_non_csv_files
I would like to copy to my local system all CSV files while keeping the directory structure.我想将所有 CSV 文件复制到我的本地系统,同时保持目录结构。
I tried hdfs dfs -copyToLocal /some/path/report*
but that method copies a lot of unnecessary (and quite large) files that I don't want to get.我尝试
hdfs dfs -copyToLocal /some/path/report*
但该方法复制了许多我不想获得的不必要(而且非常大)的文件。
I also tried hdfs dfs -copyToLocal /some/path/report*/file*.csv
but this does not preserve the directory structure and HDFS complains that the files already exist when it tries to copy the files from the folder report02
.我也试过
hdfs dfs -copyToLocal /some/path/report*/file*.csv
但这不保留目录结构和report02
抱怨它从文件夹中复制文件时报告文件已经存在。
Is there a way to get only files matching a specific pattern while still keeping the original directory structure?有没有办法只获取与特定模式匹配的文件,同时仍保持原始目录结构?
As it seems that there isn't any solution directly implemented in Hadoop, I finally ended up creating my own bash script:由于似乎在 Hadoop 中没有直接实现任何解决方案,我最终创建了自己的 bash 脚本:
#!/bin/bash
# pattern of files to get
TO_GET=("*.csv$" "*.png$")
# pattern of files/directories to avoid
TO_AVOID=("*_temporary*")
# function to join an array by a specified separator:
# usage: join_arr ";" ${array[@]}
join_arr() {
local IFS="$1"
shift
echo "$*"
}
if (($# != 2))
then
echo "There should be two parameters (path of the directory to get and destination)."
else
# ensure that the provided path ends with a slash
[[ "$1" != */ ]] && path="$1/" || path="$1"
echo "Path to copy: $path"
# ensure that the provided destination ends with a slash and append result directory name
[[ "$2" != */ ]] && dest="$2/" || dest="$2"
dest="$dest$(basename $path)/"
echo "Destination: $dest"
# get name of all files matching the patterns
echo -n "Exploring path to find matching files... "
readarray -t files < <(hdfs dfs -ls -R "$path" | egrep -v "$(join_arr "|" "${TO_AVOID[@]}")" | egrep "$(join_arr "|" "${TO_GET[@]}")" | awk '{print $NF}' | cut -c $((${#path}+1))-)
echo "Done!"
# check if at least one file found
[ -z "$files" ] && echo "No files matching the patern."
# get files one by one
for file in ${files[@]}
do
path_and_file="$path$file"
dest_and_file="$dest$file"
# make sure the directory exist on the local file system
mkdir -p "$(dirname "$dest_and_file")"
# get file in a separate process to be able to execute the queries in parallel
(hdfs dfs -copyToLocal -f "$path_and_file" "$dest_and_file" && echo "$file") &
done
# wait for all queries to be finished
wait
fi
You can call the script like that:您可以这样调用脚本:
$ script.sh "/some/hdfs/path/folder_to_get" "/some/local/path"
The script will create a directory folder_to_get
in /some/local/path
with all CSV and PNG files, respecting the directory structure.该脚本将在
/some/local/path
中创建一个目录folder_to_get
,其中包含所有 CSV 和 PNG 文件,尊重目录结构。
Note: If you want to get other files than CSV and PNG, just modify the variable TO_GET
at the top of the script.注意:如果要获取除 CSV 和 PNG 以外的其他文件,只需修改脚本顶部的变量
TO_GET
。 You can also modify the TO_AVOID
variable to filter directories that you don't want to scan even if they contains CSV or PNG files.您还可以修改
TO_AVOID
变量以过滤您不想扫描的目录,即使它们包含 CSV 或 PNG 文件。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.