如何有效地将 Spark 中的数据框与小文件目录连接起来？

Question

I have a dataframe ( df1 ) with the columns id, date, type .我有一个数据框（ df1 ），其中列id, date, type 。 I need to join it with a file /descriptions/{id}.txt .我需要用文件/descriptions/{id}.txt加入它。 The result of the join should be a dataframe id, date, type, description for all entries in ( df1 ) .连接的结果应该是 ( df1 ) 中所有条目的数据帧id, date, type, description 。

The goal is to use this preprocessed dataframe for futher analysis so I don't have to deal with the small files anymore.目标是使用这个预处理过的数据框进行进一步分析，这样我就不必再处理小文件了。

Worth noting: There are a lot more small description files than I need (x1000) so I think some kind of "lazy join" would be more efficient than reading all of the small files upfront and join then.值得注意的是：小描述文件比我需要的多得多（x1000），所以我认为某种“懒惰加入”比预先读取所有小文件然后加入更有效。

How would you build this in Spark?你将如何在 Spark 中构建它？ I use scala currently but if you have a python example I think I could work with that as well.我目前使用 Scala，但如果你有一个 python 示例，我想我也可以使用它。

Answer 1

Option 1, reading all the files:选项 1，读取所有文件：

You could read the descriptions with wholeTextFiles .您可以使用wholeTextFiles阅读描述。 It returns a RDD mapping the file paths to their content, and then join the result with your dataframe.它返回一个将文件路径映射到其内容的 RDD，然后将结果与您的数据帧连接起来。

val descriptions = sc
    .wholeTextFiles(".../descriptions")
    // We need to extract the id from the file path
    .map{ case (path, desc) => {
        val fileName = path.split("/").last
        val id = "[0-9]+".r.findFirstIn(fileName)
        id.get.toLong -> desc
    }}
    .toDF("id", "description")

val result = df1.join(descriptions, Seq("id"))

Option 2, reading only the files you need选项 2，仅读取您需要的文件

For that, you could use binaryFiles .为此，您可以使用binaryFiles 。 It creates a RDD that maps each file path to a DataStream .它创建了一个 RDD，将每个文件路径映射到一个DataStream 。 Therefore, the files are not read right away.因此，不会立即读取文件。 Then you can select all the distinct id from df1 , join them with the RDD and then only read the content of the files you need.然后你可以从df1选择所有不同的id ，将它们与 RDD 连接起来，然后只读取你需要的文件的内容。 The code would look like this:代码如下所示：

val idRDD = df1
    .select("id").distinct
    .rdd.map(_.getAs[Long]("id") -> true)

val descriptions = sc.binaryFiles(".../descriptions")
    // same as before, but the description is not read yet
    .map{ case (path, descFile) => {
        val fileName = path.split("/").last
        val id = "[0-9]+".r.findFirstIn(fileName)
        id.get.toLong -> descFile
    }} // inner join with the ids we are interested in
    .join(idRDD)
    .map{ case(id, (file, _)) => id -> file}
    // reading the files
    .mapValues(file => {
         val reader = scala.io.Source.fromInputStream(file.open)
         val desc = reader.getLines.mkString("\n")
         reader.close
         desc
    })
    .toDF("id", "description")

val result = df1.join(descriptions, Seq("id"))

如何有效地将 Spark 中的数据框与小文件目录连接起来？

问题描述

1 个解决方案

解决方案1
-1 2020-02-07 09:22:16

如何有效地将 Spark 中的数据框与小文件目录连接起来？

问题描述

1 个解决方案

解决方案1 -1 2020-02-07 09:22:16

解决方案1
-1 2020-02-07 09:22:16