[英]Applying Bucketizer to Spark dataframe after partitioning based on a column value
I need to apply spark bucketizer on below dataframe df
.我需要在下面的数据帧
df
上应用 spark bucketizer。 This is mockup data.这是样机数据。 Original dataframe has around 10k records.
原始数据帧有大约 10k 条记录。
instance name value percentage
A37 Histogram.ratio 1 0.20
A37 Histogram.ratio 20 0.34
A37 Histogram.ratio 50 0.04
A37 Histogram.ratio 500 0.13
A37 Histogram.ratio 2000 0.05
A37 Histogram.ratio 9000 0.32
A49 Histogram.ratio 1 0.50
A49 Histogram.ratio 20 0.24
A49 Histogram.ratio 25 0.09
A49 Histogram.ratio 55 0.12
A49 Histogram.ratio 120 0.06
A49 Histogram.ratio 300 0.08
I need to apply bucketizer after partitioning the dataframe by column instance
.我需要在按列
instance
对数据帧进行分区后应用bucketizer。 Each value in instance
has different split array which is defined below instance
每个值都有不同的拆分数组,定义如下
val splits_map = Map("A37" -> Array(0,30,1000,5000,9000), "A49" -> Array(0,10,30,80,998))
i will perform bucketing on single column using below code.我将使用以下代码对单列执行分桶。 But need help in partitioning the dataframe by
instance
column and then applying bucketizer.transform但是需要帮助按
instance
列对数据帧进行分区,然后应用bucketizer.transform
val bucketizer = new Bucketizer().setInputCol("value").setOutputCol("value_range").setSplits(splits)
val df2 = bucketizer.transform(df)
df2.groupBy("value_range").sum("percentage").show()
Is it possible to split dataFrame into multiple dataFrame with column value instance
then bucketize the value
column, then use groupBy().sum() to calculate the sum of percentage.是否可以将数据帧拆分为具有列值
instance
多个数据帧,然后对value
列进行桶化,然后使用 groupBy().sum() 计算百分比的总和。
Expected output:预期输出:
instance name bucket percentage
A37 Histogram.ratio 0 0.54
A37 Histogram.ratio 1 0.17
A37 Histogram.ratio 3 0.05
A37 Histogram.ratio 4 0.32
A49 Histogram.ratio 0 0.50
A49 Histogram.ratio 1 0.33
A49 Histogram.ratio 2 0.12
A49 Histogram.ratio 3 0.14
The alternative way to bucketize the data within partition:在分区内对数据进行分桶的另一种方法:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import spark.implicits._
def bucketizeWithinPartition(df: DataFrame, splits: Map[String, Array[Int]], partitionCol: String, featureCol: String): DataFrame = {
val window = Window.partitionBy(partitionCol).orderBy($"bucket_start")
val splitsDf = splits.toList.toDF(partitionCol, "splits")
.withColumn("bucket_start", explode($"splits"))
.withColumn("bucket_end", coalesce(lead($"bucket_start", 1).over(window), lit(Int.MaxValue)))
.withColumn("bucket", row_number().over(window))
val joinCond = "d.%s = s.%s AND d.%s >= s.bucket_start AND d.%s < bucket_end".format(partitionCol, partitionCol, featureCol, featureCol)
df.as("d")
.join(splitsDf.as("s"), expr(joinCond), "inner")
.select($"d.*", $"s.bucket")
}
val data =
List(
("A37", "Histogram.ratio", 1, 0.20),
("A37", "Histogram.ratio", 20, 0.34),
("A37", "Histogram.ratio", 9000, 0.32),
("A49", "Histogram.ratio", 1, 0.50),
("A49", "Histogram.ratio", 20, 0.24)
).toDF("instance", "name", "value", "percentage")
val splits_map = Map("A37" -> Array(0,30,1000,5000,9000), "A49" -> Array(0,10,30,80,998))
val bucketedData = bucketizeWithinPartition(data, splits_map, "instance", "value")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.