简体   繁体   中英

spark look up from a small file

I am doing a spark project and needs advise on how solve the below problem in the best way:

I have a Data Frame(Say MainDF ), which has millions of records. The format is like this (name:String, value:Int) . Content example below:

Davi,130
Joel,20
Emma,500

I have another small file, with 4 lines of record, like this (className:String,minValue:Int,maxValue:Int) Now I need create a file by looking up the class name based on the value between min and max , output for the above record below:

First,500,9999999
Second,100,499
Third,0,99
Unknown,-99999,0

I need to look up this small file for each value in the MainDF, and add the Class name based on the value range from small File.Example :

Davi,130,Second
Joel,20,Third
Emma,500,First

This is the code I have written:

//Main Data read, millions of records
val MainData = sc.textFile("/mainfile.csv")
case class MainType(Name:String,value:Int)
val MainDF = MainData .map(line => line.split(",")).map(e =>MainType(e(0),e(1).toInt))).toDF
MainDF.registerTempTable("MainTable")
val refData = sc.broadast( sc.textFile("/refdata.csv"))
case class refDataType (className:String,minValue:Int,maxValue:Int)
//ref data, just 4 records
val refRDD = refData.map(line=> line.split(",")).map( e => refDataType ( e(0) , e(1).toInt, e(2).toInt))

I think I have to write a UDF here, but I dont know how to use a Dataframe in a UDF, or is there any way to do this in spark scala

You can read the file as an RDD by using textFile , collect it since it's very small (and maybe broadcast depending on your requirement).

Once you have the Array by collecting the RDD, you can create a Range and then a UDF to check if your value is in that range.

val rdd = sc.parallelize(Array(
("First",500,9999999),
("Second",100,499),
("Third",0,99),
("Unknown",-99999,0)
))

val dataArr = rdd.map{ case (className, min, max) => 
                       (className, Range(min, max) )  }.collect
// First Element will be the Class Name
// Second will be the Range(min, max)
// sc.broadcast(dataArr) here

val getClassName = udf {(x: Int) => { 
                  dataArr.map{ e => 
                        if (e._2.contains(x) ) e._1.toString 
                        else null.asInstanceOf[String] }
                  .filter(_ != null )
                  .apply(0) }}

df.withColumn("ClassName", getClassName($"VALUE") ).show
+----+-----+---------+
|NAME|VALUE|ClassName|
+----+-----+---------+
|Davi|  130|   Second|
|Joel|   20|    Third|
|Emma|  500|    First|
+----+-----+---------+

I'm positive there might be better solutions available.

The easiest way here is to read both the files using the csv datasource and joining them using standard SparkSQL, like this:

import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}
val mainSchema = StructType(Seq(StructField("name", StringType, false), 
StructField("value", IntegerType, false)))
val mainDf = spark.read.schema(mainSchema).csv("/tmp/b.txt")
val lookupSchema = StructType(Seq(StructField("class_name", StringType, false), StructField("min_value", IntegerType, false), 
StructField("max_value", IntegerType, false)))
val lookupDf = spark.read.schema(lookupSchema).csv("/tmp/a.txt")
val result = mainDf.join(lookupDf, $"value" <= $"max_value" && $"value" > $"min_value")
result.show()

I am not sure whether the most performant way is this one or the one suggested by @philantrovert (this might also depend on the Spark version you are using). You should try both them and decide yourself.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM