How to create multiple DataFrames from a multiple lists in Scala Spark

Question

I'm trying to create multiple DataFrames from the two lists below,

val paths = ListBuffer("s3://abc_xyz_tableA.json",
                       "s3://def_xyz_tableA.json",
                       "s3://abc_xyz_tableB.json",
                       "s3://def_xyz_tableB.json",
                       "s3://abc_xyz_tableC.json",....)

val tableNames = ListBuffer("tableA","tableB","tableC","tableD",....)

I want to create different dataframes using the table names by bringing all the common table name ending s3 paths together as they have the unique schema.

so for example if the tables and paths related to it are brought together then -

 "tableADF" will have all the data from these paths "s3://abc_xyz_tableA.json", "s3://def_xyz_tableA.json" as they have "tableA" in the path

 "tableBDF" will have all the data from these paths "s3://abc_xyz_tableB.json", "s3://def_xyz_tableB.json" as they have "tableB" in the path

and so on there can be many tableNames and Paths

I'm trying different approaches but not successful yet. Any leads in achieving the desired solution will be of great help. Thanks!

Answer 1

using input_file_name() udf, you can filter based on the file names to get the dataframe for each file/file patterns

import org.apache.spark.sql.functions._
import spark.implicits._
var df = spark.read.format("json").load("s3://data/*.json")
df = df.withColumn(
  "input_file", input_file_name()
)

val tableADF= df.filter($"input_file".endsWith("tableA.json"))
val tableBDF= df.filter($"input_file".endsWith("tableB.json"))

Answer 2

If the file post fix name list is pretty long then you an use something as below, Also find the code explanation inline

import org.apache.spark.sql.functions._


object DFByFileName {

  def main(args: Array[String]): Unit = {

    val spark = Constant.getSparkSess

    import spark.implicits._

    //Load your JSON data
    var df = spark.read.format("json").load("s3://data/*.json")

    //Add a column with file name
    df = df.withColumn(
      "input_file", (input_file_name())
    )

    //Extract unique file postfix from the file names in a List
    val fileGroupList = df.select("input_file").map(row => {
      val fileName = row.getString(0)
      val index1 = fileName.lastIndexOf("_")
      val index2 = fileName.lastIndexOf(".")
      fileName.substring(index1 + 1, index2)
    }).collect()

    //Iterate file group name to map of (fileGroup -> Dataframe of file group) 
    fileGroupList.map(fileGroupName => {
      df.filter($"input_file".endsWith(s"${fileGroupName}.json"))
      //perform dataframe operations
    })
  }

}

Answer 3

Check below code & Final result type is

scala.collection.immutable.Map[String,org.apache.spark.sql.DataFrame] = Map(tableBDF -> [...], tableADF -> [...], tableCDF -> [...]) where ... is your column list.

paths
.map(path => (s"${path.split("_").last.split("\\.json").head}DF",path)) // parsing file names and extracting table name and path into tuple
.groupBy(_._1) // grouping paths based same table name
.map(p => (p._1 -> p._2.map(_._2))).par // combining paths for same table into list and also .par function to execute subsequent steps  in Parallel
.map(mp => { 
      (
         mp._1, // table name
         mp._2.par // For same DF multiple Files load parallel.
                   .map(spark.read.json(_)) // loading files s3
                   .reduce(_ union _) // union if same table has multiple files.
      )
   }
)

How to create multiple DataFrames from a multiple lists in Scala Spark

Question

3 answers

solution1
0 2020-06-03 19:40:01

solution2
0 2020-06-03 20:08:03

solution3
0 2020-06-03 23:09:47

How to create multiple DataFrames from a multiple lists in Scala Spark

Question

3 answers

solution1 0 2020-06-03 19:40:01

solution2 0 2020-06-03 20:08:03

solution3 0 2020-06-03 23:09:47

solution1
0 2020-06-03 19:40:01

solution2
0 2020-06-03 20:08:03

solution3
0 2020-06-03 23:09:47