列出使用Scala或Python存儲在Hadoop HDFS上的Spark集群中可用的所有文件？

Question

列出Spark中本地可用的所有文件名的最有效方法是什么？ 我正在使用Scala API，但是Python也應該沒問題。

Answer 1

import org.apache.hadoop.fs.{FileSystem, FileUtil, Path}
import scala.collection.mutable.Stack


 val fs = FileSystem.get( sc.hadoopConfiguration )
 var dirs = Stack[String]()
 val files = scala.collection.mutable.ListBuffer.empty[String]
 val fs = FileSystem.get(sc.hadoopConfiguration)
 dirs.push("/user/username/")

 while(!dirs.isEmpty){
     val status = fs.listStatus(new Path(dirs.pop()))
     status.foreach(x=> if(x.isDirectory) dirs.push(x.getPath.toString) else 
     files+= x.getPath.toString)
 }

files.foreach(println)

列出使用Scala或Python存儲在Hadoop HDFS上的Spark集群中可用的所有文件？

問題描述

1 個解決方案

解決方案1
0 2017-05-17 18:37:19

列出使用Scala或Python存儲在Hadoop HDFS上的Spark集群中可用的所有文件？

問題描述

1 個解決方案

解決方案1 0 2017-05-17 18:37:19

解決方案1
0 2017-05-17 18:37:19