I have a dataframe ( df1
) with the columns id, date, type
. I need to join it with a file /descriptions/{id}.txt
. The result of the join should be a dataframe id, date, type, description
for all entries in ( df1
) .
The goal is to use this preprocessed dataframe for futher analysis so I don't have to deal with the small files anymore.
Worth noting: There are a lot more small description files than I need (x1000) so I think some kind of "lazy join" would be more efficient than reading all of the small files upfront and join then.
How would you build this in Spark? I use scala currently but if you have a python example I think I could work with that as well.
Option 1, reading all the files:
You could read the descriptions with wholeTextFiles
. It returns a RDD mapping the file paths to their content, and then join the result with your dataframe.
val descriptions = sc
.wholeTextFiles(".../descriptions")
// We need to extract the id from the file path
.map{ case (path, desc) => {
val fileName = path.split("/").last
val id = "[0-9]+".r.findFirstIn(fileName)
id.get.toLong -> desc
}}
.toDF("id", "description")
val result = df1.join(descriptions, Seq("id"))
Option 2, reading only the files you need
For that, you could use binaryFiles
. It creates a RDD that maps each file path to a DataStream
. Therefore, the files are not read right away. Then you can select all the distinct id
from df1
, join them with the RDD and then only read the content of the files you need. The code would look like this:
val idRDD = df1
.select("id").distinct
.rdd.map(_.getAs[Long]("id") -> true)
val descriptions = sc.binaryFiles(".../descriptions")
// same as before, but the description is not read yet
.map{ case (path, descFile) => {
val fileName = path.split("/").last
val id = "[0-9]+".r.findFirstIn(fileName)
id.get.toLong -> descFile
}} // inner join with the ids we are interested in
.join(idRDD)
.map{ case(id, (file, _)) => id -> file}
// reading the files
.mapValues(file => {
val reader = scala.io.Source.fromInputStream(file.open)
val desc = reader.getLines.mkString("\n")
reader.close
desc
})
.toDF("id", "description")
val result = df1.join(descriptions, Seq("id"))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.