如何枚举HDFS目录中的文件

Question

How do I enumerate files in HDFS directory? 如何枚举HDFS目录中的文件？ This is for enumerating files in Apache Spark cluster using Scala. 这用于使用Scala枚举Apache Spark集群中的文件。 I see there is sc.textfile() option but that will read the contents as-well. 我看到有sc.textfile（）选项，但它也会读取内容。 I want to read only file names. 我只想读取文件名。

I actually tried the listStatus. 我实际上尝试了listStatus。 But didn't work. 但是没有用。 Get the below error. 得到以下错误。 I am using Azure HDInsight Spark and the blob store folder "testContainer@testhdi.blob.core.windows.net/example/" contains .json files. 我正在使用Azure HDInsight Spark，并且blob存储文件夹“ testContainer@testhdi.blob.core.windows.net/example/”包含.json文件。

val fs = FileSystem.get(new Configuration())
val status = fs.listStatus(new Path("wasb://testContainer@testhdi.blob.core.windows.net/example/"))
status.foreach(x=> println(x.getPath)

=========
Error:
========
java.io.FileNotFoundException: Filewasb://testContainer@testhdi.blob.core.windows.net/example does not exist.
    at org.apache.hadoop.fs.azure.NativeAzureFileSystem.listStatus(NativeAzureFileSystem.java:2076)
    at $iwC$$iwC$$iwC$$iwC.<init>(<console>:23)
    at $iwC$$iwC$$iwC.<init>(<console>:28)
    at $iwC$$iwC.<init>(<console>:30)
    at $iwC.<init>(<console>:32)
    at <init>(<console>:34)
    at .<init>(<console>:38)
    at .<clinit>(<console>)
    at .<init>(<console>:7)
    at .<clinit>(<console>)
    at $print(<console>)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
    at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
    at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
    at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
    at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
    at com.cloudera.livy.repl.scalaRepl.SparkInterpreter$$anonfun$executeLine$1.apply(SparkInterpreter.scala:272)
    at com.cloudera.livy.repl.scalaRepl.SparkInterpreter$$anonfun$executeLine$1.apply(SparkInterpreter.scala:272)
    at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
    at scala.Console$.withOut(Console.scala:126)
    at com.cloudera.livy.repl.scalaRepl.SparkInterpreter.executeLine(SparkInterpreter.scala:271)
    at com.cloudera.livy.repl.scalaRepl.SparkInterpreter.executeLines(SparkInterpreter.scala:246)
    at com.cloudera.livy.repl.scalaRepl.SparkInterpreter.execute(SparkInterpreter.scala:104)
    at com.cloudera.livy.repl.Session.com$cloudera$livy$repl$Session$$executeCode(Session.scala:98)
    at com.cloudera.livy.repl.Session$$anonfun$3.apply(Session.scala:73)
    at com.cloudera.livy.repl.Session$$anonfun$3.apply(Session.scala:73)
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

Thanks! 谢谢！

Answer 1

The reason this is failing is because it is actually looking in your the default storage container rather than the testContainer, and thus not finding the example folder. 失败的原因是因为它实际上是在默认存储容器中查找，而不是在testContainer中查找，因此找不到示例文件夹。 You can see this by changing the path to wasb://testContainer@testhdi.blob.core.windows.net/ and it will list files from a different container. 您可以通过将路径更改为wasb：//testContainer@testhdi.blob.core.windows.net/来查看此内容，它将列出来自其他容器的文件。

I don't know why this is, but I discovered you can fix it by passing the path to the FileSystem.get call like this: 我不知道为什么会这样，但是我发现您可以通过将路径传递给FileSystem.get调用来修复它，如下所示：

val fs = FileSystem.get(new java.net.URI("wasb://testContainer@testhdi.blob.core.windows.net/example/"), new Configuration())
val status = fs.listStatus(new Path("wasb://testContainer@testhdi.blob.core.windows.net/example/"))
status.foreach(x=> println(x.getPath)

Answer 2

see FileSystem class 参见FileSystem类

abstract FileStatus[] listStatus(Path f) 抽象FileStatus [] listStatus（Path f）

List the statuses of the files/directories in the given path if the path is a directory. 如果路径是目录，请列出给定路径中文件/目录的状态。

val fs = FileSystem.get(new Configuration())
val status = fs.listStatus(new Path(HDFS_PATH))
status.foreach(x=> println(x.getPath)

Note : HDFS api you can access from any language like java or scala below is java example as well 注意：您也可以从Java或Scala等任何语言访问的HDFS API也是Java示例

/**
     * Method listFileStats.
     * 
     * @param destination
     * @param fs
     * @throws FileNotFoundException
     * @throws IOException
     */
    public static void listFileStats(final String destination, final FileSystem fs) throws FileNotFoundException, IOException {
        final FileStatus[] statuss = fs.listStatus(new Path(destination));
        for (final FileStatus status : statuss) {
            LOG.info("--  status {}    ", status.toString());
            LOG.info("Human readable size {} of file ", FileUtils.byteCountToDisplaySize(status.getLen())); //import org.apache.commons.io.FileUtils;
        }
    }
}

如何枚举HDFS目录中的文件

问题描述

2 个解决方案

解决方案1
3 已采纳 2016-08-17 17:55:02

解决方案2
2 2016-06-19 09:52:18

如何枚举HDFS目录中的文件

问题描述

2 个解决方案

解决方案1 3 已采纳 2016-08-17 17:55:02

解决方案2 2 2016-06-19 09:52:18

解决方案1
3 已采纳 2016-08-17 17:55:02

解决方案2
2 2016-06-19 09:52:18