如何将子目录中的文件复制到HDFS中的单个目录

Question

I have an external table in Impala that is partitioned by two columns so the HDFS directory has two level of directories before you get to the actual data files. 我在Impala中有一个外部表，该表由两列划分，因此在获取实际数据文件之前，HDFS目录具有两级目录。 The table has become corrupt in the meta store and cannot be queried. 该表已在元存储中损坏，无法查询。 I want to copy only the individual (~10k) files into a single directory so I can drop the corrupt table, remove the existing directories and then run the data back into the table with the LOAD DATA INTO table query in Impala. 我只想将单个文件（〜10k）复制到一个目录中，以便删除损坏的表，删除现有目录，然后使用Impala中的LOAD DATA INTO表查询将数据运行回表中。 The problem is I cannot find a way to copy just the files so they all end up in a single directory since the LOAD DATA doesn't support subdirectory loading. 问题是我无法找到一种仅复制文件的方法，因此它们都以单个目录结尾，因为LOAD DATA不支持子目录加载。

The structure looks like: 结构如下：

myroot myroot
- mysub1a mysub1a
  - mysub2a mysub2a
    - file1.txt FILE1.TXT
    - file2.txt FILE2.TXT

There are hundreds of directories at the mysub1 and mysub2 levels 在mysub1和mysub2级别上有数百个目录

I have been able to get the correct list of just the files with: 我已经能够使用以下命令获取文件的正确列表：

hadoop fs -lsr /myroot/ | hadoop fs -lsr / myroot / | grep .parq grep .parq

but I cannot figure out how to pass in the output of this list into 但我不知道如何将此列表的输出传递到

hadoop fs -cp {mylist} /mynewdir/ hadoop fs -cp {mylist} / mynewdir /

Answer 1

Wildcards should do the trick: 通配符应该可以解决问题：

hdfs dfs -cp /myroot/*/*/*.parq /mynewdir

Note that if you don't need the files at the original locations then a hdfs dfs -mv will be much faster. 请注意，如果您不需要原始位置的文件，那么hdfs dfs -mv会更快。

如何将子目录中的文件复制到HDFS中的单个目录

问题描述

1 个解决方案

解决方案1
1 已采纳 2015-01-28 21:52:27

如何将子目录中的文件复制到HDFS中的单个目录

问题描述

1 个解决方案

解决方案1 1 已采纳 2015-01-28 21:52:27

解决方案1
1 已采纳 2015-01-28 21:52:27