斯卡拉；遞歸遍歷父 Hadoop 目錄中的所有目錄

Question

我一直在嘗試創建一個遞歸函數來遍歷 Hadoop 父路徑中的所有目錄。 我在下面有以下函數，但輸出是一堆嵌套的對象數組，所以不完全是我正在尋找的，但它確實走 Hadoop 路徑。 任何建議都非常感謝。 我的目標是讓返回類型為 Array[Path]。

獲取給定父目錄示例的底層分區路徑：parent /hadoop/parent/path with partitions month , day所以在這種情況下，我們期望一個具有 365 個路徑的數組。

import org.apache.hadoop.fs.{FileStatus, FileSystem, Path}
val parentPath = "/hadoop/parent/path"
val hdfsPath: Path = new Path(parentPath)

def recursiveWalk(hdfsPath: Path): Array[Object] = {
    val fs: FileSystem = hdfsPath.getFileSystem(spark.sessionState.newHadoopConf())
    val fileIterable = fs.listStatus(hdfsPath)
    val res = for (f <- fileIterable) yield {
        if (f.isDirectory) {
            recursiveWalk(f.getPath).distinct
        }
        else {
            hdfsPath
        }
    }
    res.distinct
}

Answer 1

您定義了一個遞歸函數，該函數生成以下任一數組（for 循環）：

如果項目是一個目錄，則該函數的輸出是一個對象數組。
一個Path如果它是一個簡單的文件。

這解釋了您獲得嵌套數組（數組數組）的事實。

您可以使用flatMap來避免該問題。 它將對象列表轉換（或“扁平化”）為對象列表。 此外，要獲得您期望的類型，您需要在停止條件和遞歸（ Array of Path ）之間具有匹配類型。 所以你需要將hdfsPath包裝在一個數組中。

以下是根據我剛剛寫的內容快速解決問題的方法：

def recursiveWalk(hdfsPath: Path): Array[Path] = {
    val fs: FileSystem = hdfsPath.getFileSystem(spark.sessionState.newHadoopConf())
    val fileIterable = fs.listStatus(hdfsPath)
    val res = fileIterable.flatMap(f => {
        if (f.isDirectory) {
            recursiveWalk(f.getPath).distinct
        }
        else {
            Array(hdfsPath)
        }
    })
    res.distinct
}

上面的代碼解決了這個問題，但為了避免使用 distinct，您可以將條件放在輸入文件而不是它的子文件夾上，如下所示。 您還可以在函數之外一次性定義文件系統。

val conf = new org.apache.hadoop.conf.Configuration()
val hdfs = org.apache.hadoop.fs.FileSystem(conf)

def recursiveWalk(path : Path): Array[Path] = {
    if(hdfs.isDirectory(path))
        hdfs.listStatus(path).map(_.getPath).flatMap(rec _) :+ path
    else Array()
}

Answer 2

嘗試使用這個：

def recursiveWalk(hdfsPath: Path): Array[Path] = {
    val fs: FileSystem = hdfsPath.getFileSystem(spark.sessionState.newHadoopConf())
    if (fs.isDirectory(hdfsPath)) {
      fs.listStatus(hdfsPath).flatMap(innerPath => recursiveWalk(innerPath.getPath))
    } else Array.empty[Path]
  }

或者如果您需要目錄中的文件，也可以使用：

def getDirsWithFiles(hdfsPath: Path): Array[Path] = {
    val fs: FileSystem = hdfsPath.getFileSystem(spark.sessionState.newHadoopConf())
    if (fs.isDirectory(hdfsPath)) {
      fs.listStatus(hdfsPath).flatMap(innerPath => getDirsWithFiles(innerPath.getPath))
    } else Array(hdfsPath)
  }

斯卡拉；遞歸遍歷父 Hadoop 目錄中的所有目錄

問題描述

2 個解決方案

解決方案1
2 已采納 2020-03-09 11:25:59

解決方案2
1 2020-03-09 11:37:25

斯卡拉； 遞歸遍歷父 Hadoop 目錄中的所有目錄

問題描述

2 個解決方案

解決方案1 2 已采納 2020-03-09 11:25:59

解決方案2 1 2020-03-09 11:37:25

斯卡拉；遞歸遍歷父 Hadoop 目錄中的所有目錄

解決方案1
2 已采納 2020-03-09 11:25:59

解決方案2
1 2020-03-09 11:37:25