简体   繁体   English

从一个小文件中查找火花

[英]spark look up from a small file

I am doing a spark project and needs advise on how solve the below problem in the best way: 我正在做一个火花项目,需要关于如何以最佳方式解决以下问题的建议:

I have a Data Frame(Say MainDF ), which has millions of records. 我有一个数据框架(说MainDF),它具有数百万条记录。 The format is like this (name:String, value:Int) . 格式是这样的(name:String,value:Int)。 Content example below: 内容示例如下:

Davi,130
Joel,20
Emma,500

I have another small file, with 4 lines of record, like this (className:String,minValue:Int,maxValue:Int) Now I need create a file by looking up the class name based on the value between min and max , output for the above record below: 我还有另一个小文件,有4行记录,像这样(className:String,minValue:Int,maxValue:Int)现在,我需要通过基于min和max之间的值查找类名来创建文件,输出为上面的记录如下:

First,500,9999999
Second,100,499
Third,0,99
Unknown,-99999,0

I need to look up this small file for each value in the MainDF, and add the Class name based on the value range from small File.Example : 我需要为MainDF中的每个值查找此小文件,并根据小File中的值范围添加类名称。

Davi,130,Second
Joel,20,Third
Emma,500,First

This is the code I have written: 这是我编写的代码:

//Main Data read, millions of records
val MainData = sc.textFile("/mainfile.csv")
case class MainType(Name:String,value:Int)
val MainDF = MainData .map(line => line.split(",")).map(e =>MainType(e(0),e(1).toInt))).toDF
MainDF.registerTempTable("MainTable")
val refData = sc.broadast( sc.textFile("/refdata.csv"))
case class refDataType (className:String,minValue:Int,maxValue:Int)
//ref data, just 4 records
val refRDD = refData.map(line=> line.split(",")).map( e => refDataType ( e(0) , e(1).toInt, e(2).toInt))

I think I have to write a UDF here, but I dont know how to use a Dataframe in a UDF, or is there any way to do this in spark scala 我想我必须在这里编写UDF,但是我不知道如何在UDF中使用数据框,或者在spark scala中有什么方法可以做到这一点

You can read the file as an RDD by using textFile , collect it since it's very small (and maybe broadcast depending on your requirement). 您可以使用textFile将其读取为RDD文件,因为它很小(可以根据需要进行广播),请收集该文件。

Once you have the Array by collecting the RDD, you can create a Range and then a UDF to check if your value is in that range. 通过收集RDD获得数组后,可以创建一个Range ,然后创建一个UDF以检查您的值是否在该范围内。

val rdd = sc.parallelize(Array(
("First",500,9999999),
("Second",100,499),
("Third",0,99),
("Unknown",-99999,0)
))

val dataArr = rdd.map{ case (className, min, max) => 
                       (className, Range(min, max) )  }.collect
// First Element will be the Class Name
// Second will be the Range(min, max)
// sc.broadcast(dataArr) here

val getClassName = udf {(x: Int) => { 
                  dataArr.map{ e => 
                        if (e._2.contains(x) ) e._1.toString 
                        else null.asInstanceOf[String] }
                  .filter(_ != null )
                  .apply(0) }}

df.withColumn("ClassName", getClassName($"VALUE") ).show
+----+-----+---------+
|NAME|VALUE|ClassName|
+----+-----+---------+
|Davi|  130|   Second|
|Joel|   20|    Third|
|Emma|  500|    First|
+----+-----+---------+

I'm positive there might be better solutions available. 我很肯定可能会有更好的解决方案。

The easiest way here is to read both the files using the csv datasource and joining them using standard SparkSQL, like this: 此处最简单的方法是使用csv数据源读取两个文件,然后使用标准SparkSQL将它们加入,如下所示:

import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}
val mainSchema = StructType(Seq(StructField("name", StringType, false), 
StructField("value", IntegerType, false)))
val mainDf = spark.read.schema(mainSchema).csv("/tmp/b.txt")
val lookupSchema = StructType(Seq(StructField("class_name", StringType, false), StructField("min_value", IntegerType, false), 
StructField("max_value", IntegerType, false)))
val lookupDf = spark.read.schema(lookupSchema).csv("/tmp/a.txt")
val result = mainDf.join(lookupDf, $"value" <= $"max_value" && $"value" > $"min_value")
result.show()

I am not sure whether the most performant way is this one or the one suggested by @philantrovert (this might also depend on the Spark version you are using). 我不确定最有效的方式是这种方式还是@philantrovert建议的方式(这也可能取决于您使用的Spark版本)。 You should try both them and decide yourself. 您应该同时尝试它们并自行决定。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Spark:在RDD中查找元素的最快方法 - Spark: Fastest way to look up an element in an RDD 如何在Spark中使用多个键构建查找功能 - How to build a look up function with multiple keys in spark Spark最佳方法查找数据帧以提高性能 - Spark best approach Look-up Dataframe to improve performance 如何在Spark中处理大数据文件的一小部分? - How to work on small portion of big Data File in spark? 将大文件拆分为小文件,并使用spark保存在不同的路径中 - split large file into small files and save in different paths using spark 如何减少Spark中的多个小文件加载时间 - how to reduce multiple small file load time in spark AWS EMR 笔记本 Spark 内核无限加载小型 JSON 文件 - AWS EMR notebook Spark kernel infinitely loads small JSON file 如何在写入前从 Spark Dataframe 中删除小分区 - How to drop small partitions from Spark Dataframe before writing spark-如何在另一个RDD转换中查找(Java)PairRDD的键和值 - spark - how to look up the keys and values of a (Java)PairRDD inside another RDD's transformation 在 Apache Spark 中,当增加工人数量时,对于一些小数据集无法达到更好的加速 - Can not reach better speed up in Apache Spark for some small datasets when increase the number of workers
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM