[英]How can I copy files in subdirectories to a single directory in HDFS
I have an external table in Impala that is partitioned by two columns so the HDFS directory has two level of directories before you get to the actual data files. 我在Impala中有一个外部表,该表由两列划分,因此在获取实际数据文件之前,HDFS目录具有两级目录。 The table has become corrupt in the meta store and cannot be queried.
该表已在元存储中损坏,无法查询。 I want to copy only the individual (~10k) files into a single directory so I can drop the corrupt table, remove the existing directories and then run the data back into the table with the LOAD DATA INTO table query in Impala.
我只想将单个文件(〜10k)复制到一个目录中,以便删除损坏的表,删除现有目录,然后使用Impala中的LOAD DATA INTO表查询将数据运行回表中。 The problem is I cannot find a way to copy just the files so they all end up in a single directory since the LOAD DATA doesn't support subdirectory loading.
问题是我无法找到一种仅复制文件的方法,因此它们都以单个目录结尾,因为LOAD DATA不支持子目录加载。
The structure looks like: 结构如下:
There are hundreds of directories at the mysub1 and mysub2 levels 在mysub1和mysub2级别上有数百个目录
I have been able to get the correct list of just the files with: 我已经能够使用以下命令获取文件的正确列表:
hadoop fs -lsr /myroot/ | hadoop fs -lsr / myroot / | grep .parq
grep .parq
but I cannot figure out how to pass in the output of this list into 但我不知道如何将此列表的输出传递到
hadoop fs -cp {mylist} /mynewdir/ hadoop fs -cp {mylist} / mynewdir /
Wildcards should do the trick: 通配符应该可以解决问题:
hdfs dfs -cp /myroot/*/*/*.parq /mynewdir
Note that if you don't need the files at the original locations then a hdfs dfs -mv
will be much faster. 请注意,如果您不需要原始位置的文件,那么
hdfs dfs -mv
会更快。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.