简体   繁体   English

如何将子目录中的文件复制到HDFS中的单个目录

[英]How can I copy files in subdirectories to a single directory in HDFS

I have an external table in Impala that is partitioned by two columns so the HDFS directory has two level of directories before you get to the actual data files. 我在Impala中有一个外部表,该表由两列划分,因此在获取实际数据文件之前,HDFS目录具有两级目录。 The table has become corrupt in the meta store and cannot be queried. 该表已在元存储中损坏,无法查询。 I want to copy only the individual (~10k) files into a single directory so I can drop the corrupt table, remove the existing directories and then run the data back into the table with the LOAD DATA INTO table query in Impala. 我只想将单个文件(〜10k)复制到一个目录中,以便删除损坏的表,删除现有目录,然后使用Impala中的LOAD DATA INTO表查询将数据运行回表中。 The problem is I cannot find a way to copy just the files so they all end up in a single directory since the LOAD DATA doesn't support subdirectory loading. 问题是我无法找到一种仅复制文件的方法,因此它们都以单个目录结尾,因为LOAD DATA不支持子目录加载。

The structure looks like: 结构如下:

  • myroot myroot
    • mysub1a mysub1a
      • mysub2a mysub2a
        • file1.txt FILE1.TXT
        • file2.txt FILE2.TXT

There are hundreds of directories at the mysub1 and mysub2 levels 在mysub1和mysub2级别上有数百个目录

I have been able to get the correct list of just the files with: 我已经能够使用以下命令获取文件的正确列表:

hadoop fs -lsr /myroot/ | hadoop fs -lsr / myroot / | grep .parq grep .parq

but I cannot figure out how to pass in the output of this list into 但我不知道如何将此列表的输出传递到

hadoop fs -cp {mylist} /mynewdir/ hadoop fs -cp {mylist} / mynewdir /

Wildcards should do the trick: 通配符应该可以解决问题:

hdfs dfs -cp /myroot/*/*/*.parq /mynewdir

Note that if you don't need the files at the original locations then a hdfs dfs -mv will be much faster. 请注意,如果您不需要原始位置的文件,那么hdfs dfs -mv会更快。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM