How can I efficiently join a dataframe in spark with a directory of small files?

Question

I have a dataframe ( df1 ) with the columns id, date, type . I need to join it with a file /descriptions/{id}.txt . The result of the join should be a dataframe id, date, type, description for all entries in ( df1 ) .

The goal is to use this preprocessed dataframe for futher analysis so I don't have to deal with the small files anymore.

Worth noting: There are a lot more small description files than I need (x1000) so I think some kind of "lazy join" would be more efficient than reading all of the small files upfront and join then.

How would you build this in Spark? I use scala currently but if you have a python example I think I could work with that as well.

Answer 1

Option 1, reading all the files:

You could read the descriptions with wholeTextFiles . It returns a RDD mapping the file paths to their content, and then join the result with your dataframe.

val descriptions = sc
    .wholeTextFiles(".../descriptions")
    // We need to extract the id from the file path
    .map{ case (path, desc) => {
        val fileName = path.split("/").last
        val id = "[0-9]+".r.findFirstIn(fileName)
        id.get.toLong -> desc
    }}
    .toDF("id", "description")

val result = df1.join(descriptions, Seq("id"))

Option 2, reading only the files you need

For that, you could use binaryFiles . It creates a RDD that maps each file path to a DataStream . Therefore, the files are not read right away. Then you can select all the distinct id from df1 , join them with the RDD and then only read the content of the files you need. The code would look like this:

val idRDD = df1
    .select("id").distinct
    .rdd.map(_.getAs[Long]("id") -> true)

val descriptions = sc.binaryFiles(".../descriptions")
    // same as before, but the description is not read yet
    .map{ case (path, descFile) => {
        val fileName = path.split("/").last
        val id = "[0-9]+".r.findFirstIn(fileName)
        id.get.toLong -> descFile
    }} // inner join with the ids we are interested in
    .join(idRDD)
    .map{ case(id, (file, _)) => id -> file}
    // reading the files
    .mapValues(file => {
         val reader = scala.io.Source.fromInputStream(file.open)
         val desc = reader.getLines.mkString("\n")
         reader.close
         desc
    })
    .toDF("id", "description")

val result = df1.join(descriptions, Seq("id"))

How can I efficiently join a dataframe in spark with a directory of small files?

Question

1 answers

solution1
-1 2020-02-07 09:22:16

How can I efficiently join a dataframe in spark with a directory of small files?

Question

1 answers

solution1 -1 2020-02-07 09:22:16

solution1
-1 2020-02-07 09:22:16