有没有办法列出 Hadoop hdfs 中的文件，并且只将文件名存储到本地而不是实际文件本身？

Question

有什么方法可以列出 Hadoop hdfs 中的文件并仅将文件名存储到本地？

例子：

我有一个文件india_20210517_20210523.csv 。 我目前正在使用 copytolocal 命令将文件从 hdfs 复制到本地，但将文件复制到本地非常耗时，因为文件很大。 我只需要存储在 a.txt 文件中的文件的名称，以使用 bash 脚本执行剪切操作。

请帮助我

Answer 1

最简单的方法是使用以下命令。

hdfs dfs -ls /path/fileNames | awk '{print $8}' | xargs -n 1 basename > Output.txt

这个怎么运作：

hdfs dfs -ls : This will list all the information about the path

awk '{print $8}' : To print the 8th column of the output

xargs -n 1 basename : To get the file names alone excluding the path

> Output.txt : To store the file names to a text file

希望这能回答你的问题。

Answer 2

如果您想以编程方式执行此操作，您可以使用 Hadoop 中的FileSystem和FileStatus对象：

列出您的（当前或其他）目标目录的内容，
检查该目录的每条记录是文件还是另一个目录，以及
将每个文件的名称作为新行写入本地存储的文件。

此类应用程序的代码如下所示：

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;

import java.io.File;
import java.io.PrintWriter;


public class Dir_ls
{
    public static void main(String[] args) throws Exception 
    {
        // get input directory as a command-line argument
        Path inputDir = new Path(args[0]);  

        Configuration conf = new Configuration();

        FileSystem fs = FileSystem.get(conf);

        if(fs.exists(inputDir))
        {
            // list directory's contents
            FileStatus[] fileList = fs.listStatus(inputDir);

            // create file and its writer
            PrintWriter pw = new PrintWriter(new File("output.txt"));

            // scan each record of the contents of the input directory
            for(FileStatus file : fileList)
            {
                if(!file.isDirectory()) // only take into account files
                {
                    System.out.println(file.getPath().getName());
                    pw.write(file.getPath().getName() + "\n");
                }
            }

            pw.close();
        }
        else
            System.out.println("Directory named \"" + args[0] + "\" doesn't exist.");
    }
}

因此，如果我们想列出 HDFS 的根 ( . ) 目录中的文件，我们将这些文件作为其下的内容（注意我们都有目录和文本文件）：

这将是应用程序的命令行 output：

这将是在本地存储的output.txt文本文件中写入的内容：

有没有办法列出 Hadoop hdfs 中的文件，并且只将文件名存储到本地而不是实际文件本身？

问题描述

2 个解决方案

解决方案1
1 2021-06-03 10:16:33

解决方案2
0 2021-06-01 14:37:35

有没有办法列出 Hadoop hdfs 中的文件，并且只将文件名存储到本地而不是实际文件本身？

问题描述

2 个解决方案

解决方案1 1 2021-06-03 10:16:33

解决方案2 0 2021-06-01 14:37:35

解决方案1
1 2021-06-03 10:16:33

解决方案2
0 2021-06-01 14:37:35