[英]How can I efficiently join a dataframe in spark with a directory of small files?
I have a dataframe ( df1
) with the columns id, date, type
.我有一个数据框(
df1
),其中列id, date, type
。 I need to join it with a file /descriptions/{id}.txt
.我需要用文件
/descriptions/{id}.txt
加入它。 The result of the join should be a dataframe id, date, type, description
for all entries in ( df1
) .连接的结果应该是 (
df1
) 中所有条目的数据帧id, date, type, description
。
The goal is to use this preprocessed dataframe for futher analysis so I don't have to deal with the small files anymore.目标是使用这个预处理过的数据框进行进一步分析,这样我就不必再处理小文件了。
Worth noting: There are a lot more small description files than I need (x1000) so I think some kind of "lazy join" would be more efficient than reading all of the small files upfront and join then.值得注意的是:小描述文件比我需要的多得多(x1000),所以我认为某种“懒惰加入”比预先读取所有小文件然后加入更有效。
How would you build this in Spark?你将如何在 Spark 中构建它? I use scala currently but if you have a python example I think I could work with that as well.
我目前使用 Scala,但如果你有一个 python 示例,我想我也可以使用它。
Option 1, reading all the files:选项 1,读取所有文件:
You could read the descriptions with wholeTextFiles
.您可以使用
wholeTextFiles
阅读描述。 It returns a RDD mapping the file paths to their content, and then join the result with your dataframe.它返回一个将文件路径映射到其内容的 RDD,然后将结果与您的数据帧连接起来。
val descriptions = sc
.wholeTextFiles(".../descriptions")
// We need to extract the id from the file path
.map{ case (path, desc) => {
val fileName = path.split("/").last
val id = "[0-9]+".r.findFirstIn(fileName)
id.get.toLong -> desc
}}
.toDF("id", "description")
val result = df1.join(descriptions, Seq("id"))
Option 2, reading only the files you need选项 2,仅读取您需要的文件
For that, you could use binaryFiles
.为此,您可以使用
binaryFiles
。 It creates a RDD that maps each file path to a DataStream
.它创建了一个 RDD,将每个文件路径映射到一个
DataStream
。 Therefore, the files are not read right away.因此,不会立即读取文件。 Then you can select all the distinct
id
from df1
, join them with the RDD and then only read the content of the files you need.然后你可以从
df1
选择所有不同的id
,将它们与 RDD 连接起来,然后只读取你需要的文件的内容。 The code would look like this:代码如下所示:
val idRDD = df1
.select("id").distinct
.rdd.map(_.getAs[Long]("id") -> true)
val descriptions = sc.binaryFiles(".../descriptions")
// same as before, but the description is not read yet
.map{ case (path, descFile) => {
val fileName = path.split("/").last
val id = "[0-9]+".r.findFirstIn(fileName)
id.get.toLong -> descFile
}} // inner join with the ids we are interested in
.join(idRDD)
.map{ case(id, (file, _)) => id -> file}
// reading the files
.mapValues(file => {
val reader = scala.io.Source.fromInputStream(file.open)
val desc = reader.getLines.mkString("\n")
reader.close
desc
})
.toDF("id", "description")
val result = df1.join(descriptions, Seq("id"))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.