简体   繁体   English

按字母顺序将文件从本地复制到HDFS-排序

[英]Copy files from local to HDFS in alphabetical order - Sort

I need to copy files from local file system to HDFS through shell script. 我需要通过外壳脚本将文件从本地文件系统复制到HDFS。 Suppose I have two files in my local system 假设我的本地系统中有两个文件

fewInfo.tsv.gz
fewInfo.txt

In the above case, fewInfo.tsv.gz should be copied first(s comes before x) to HDFS and then fewInfo.txt should be copied. 在上述情况下,应先将几本Info.tsv.gz复制到HDFS,然后再复制几本Info.txt。 Is this possible? 这可能吗?

Anyone aware of the internal structure as to how the "put" command works when multiple files are being copied to HDFS? 是否有人知道将多个文件复制到HDFS时“ put”命令如何工作的内部结构?

Hadoop version I am using is Hadoop 2.5.0-cdh5.3.1. 我正在使用的Hadoop版本是Hadoop 2.5.0-cdh5.3.1。

You could loop through the directory in order to find all files, sort the files and then execute the hdfs copy. 您可以遍历目录以查找所有文件,对文件进行排序,然后执行hdfs复制。 The advantage would be that you can specify the constraints for the sort (eg by filename, date, order, etc.). 好处是您可以指定排序约束(例如,按文件名,日期,顺序等)。 There are many options to perform this. 有许多选项可以执行此操作。 One would be to use the find command: 一种方法是使用find命令:

find /some/directory -type f -maxdepth 1 -type f | sort | while IFS= read -r filename; do hdfs dfs -copyFromLocal "$filename" hdfs://target/dir/; done
  • -maxdepth 1 argument prevents find from recursively descending into any subdirectories. -maxdepth 1变量可防止find递归-maxdepth 1任何子目录中。 (If you want such nested directories to get processed, you can omit this.) (如果要处理此类嵌套目录,则可以省略。)
  • -type -f specifies that only plain files will be processed. -type -f指定仅处理纯文件。
  • sort defines that the found files will be sorted. sort定义对找到的文件进行排序。 Here you have the possibility to extend by reverse order, sort for modification date, etc. 在这里,您可以按逆序扩展,修改日期排序等。
  • while IFS= read -r filename loops thgough the found files. while IFS= read -r filename遍历找到的文件。 IFS in that loop is to preserve leading and trailing white space. 该循环中的IFS是保留前导和尾随空白。 The -r option prevents read from treating backslash as a special character. -r选项可防止read将反斜杠视为特殊字符。
  • hdfs dfs -copyFromLocal "$filename" hdfs://target/dir/ takes the sorted filenames and copies them from the local directory to the hdfs directory. hdfs dfs -copyFromLocal "$filename" hdfs://target/dir/获取已排序的filenames ,并将其从本地目录复制到hdfs目录。 Alternatively you can also use hadoop -fs put "$filename" hdfs://target/dir/ 或者,您也可以使用hadoop -fs put "$filename" hdfs://target/dir/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM