如何有效地将 Spark 中的数据框与小文件目录连接起来？

Question

我有一个数据框（ df1 ），其中列id, date, type 。 我需要用文件/descriptions/{id}.txt加入它。 连接的结果应该是 ( df1 ) 中所有条目的数据帧id, date, type, description 。

目标是使用这个预处理过的数据框进行进一步分析，这样我就不必再处理小文件了。

值得注意的是：小描述文件比我需要的多得多（x1000），所以我认为某种“懒惰加入”比预先读取所有小文件然后加入更有效。

你将如何在 Spark 中构建它？ 我目前使用 Scala，但如果你有一个 python 示例，我想我也可以使用它。

Answer 1

选项 1，读取所有文件：

您可以使用wholeTextFiles阅读描述。 它返回一个将文件路径映射到其内容的 RDD，然后将结果与您的数据帧连接起来。

val descriptions = sc
    .wholeTextFiles(".../descriptions")
    // We need to extract the id from the file path
    .map{ case (path, desc) => {
        val fileName = path.split("/").last
        val id = "[0-9]+".r.findFirstIn(fileName)
        id.get.toLong -> desc
    }}
    .toDF("id", "description")

val result = df1.join(descriptions, Seq("id"))

选项 2，仅读取您需要的文件

为此，您可以使用binaryFiles 。 它创建了一个 RDD，将每个文件路径映射到一个DataStream 。 因此，不会立即读取文件。 然后你可以从df1选择所有不同的id ，将它们与 RDD 连接起来，然后只读取你需要的文件的内容。 代码如下所示：

val idRDD = df1
    .select("id").distinct
    .rdd.map(_.getAs[Long]("id") -> true)

val descriptions = sc.binaryFiles(".../descriptions")
    // same as before, but the description is not read yet
    .map{ case (path, descFile) => {
        val fileName = path.split("/").last
        val id = "[0-9]+".r.findFirstIn(fileName)
        id.get.toLong -> descFile
    }} // inner join with the ids we are interested in
    .join(idRDD)
    .map{ case(id, (file, _)) => id -> file}
    // reading the files
    .mapValues(file => {
         val reader = scala.io.Source.fromInputStream(file.open)
         val desc = reader.getLines.mkString("\n")
         reader.close
         desc
    })
    .toDF("id", "description")

val result = df1.join(descriptions, Seq("id"))

如何有效地将 Spark 中的数据框与小文件目录连接起来？

问题描述

1 个解决方案

解决方案1
-1 2020-02-07 09:22:16

如何有效地将 Spark 中的数据框与小文件目录连接起来？

问题描述

1 个解决方案

解决方案1 -1 2020-02-07 09:22:16

解决方案1
-1 2020-02-07 09:22:16