[英]Applying Bucketizer to Spark dataframe after partitioning based on a column value
我需要在下面的數據幀df
上應用 spark bucketizer。 這是樣機數據。 原始數據幀有大約 10k 條記錄。
instance name value percentage
A37 Histogram.ratio 1 0.20
A37 Histogram.ratio 20 0.34
A37 Histogram.ratio 50 0.04
A37 Histogram.ratio 500 0.13
A37 Histogram.ratio 2000 0.05
A37 Histogram.ratio 9000 0.32
A49 Histogram.ratio 1 0.50
A49 Histogram.ratio 20 0.24
A49 Histogram.ratio 25 0.09
A49 Histogram.ratio 55 0.12
A49 Histogram.ratio 120 0.06
A49 Histogram.ratio 300 0.08
我需要在按列instance
對數據幀進行分區后應用bucketizer。 instance
每個值都有不同的拆分數組,定義如下
val splits_map = Map("A37" -> Array(0,30,1000,5000,9000), "A49" -> Array(0,10,30,80,998))
我將使用以下代碼對單列執行分桶。 但是需要幫助按instance
列對數據幀進行分區,然后應用bucketizer.transform
val bucketizer = new Bucketizer().setInputCol("value").setOutputCol("value_range").setSplits(splits)
val df2 = bucketizer.transform(df)
df2.groupBy("value_range").sum("percentage").show()
是否可以將數據幀拆分為具有列值instance
多個數據幀,然后對value
列進行桶化,然后使用 groupBy().sum() 計算百分比的總和。
預期輸出:
instance name bucket percentage
A37 Histogram.ratio 0 0.54
A37 Histogram.ratio 1 0.17
A37 Histogram.ratio 3 0.05
A37 Histogram.ratio 4 0.32
A49 Histogram.ratio 0 0.50
A49 Histogram.ratio 1 0.33
A49 Histogram.ratio 2 0.12
A49 Histogram.ratio 3 0.14
在分區內對數據進行分桶的另一種方法:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import spark.implicits._
def bucketizeWithinPartition(df: DataFrame, splits: Map[String, Array[Int]], partitionCol: String, featureCol: String): DataFrame = {
val window = Window.partitionBy(partitionCol).orderBy($"bucket_start")
val splitsDf = splits.toList.toDF(partitionCol, "splits")
.withColumn("bucket_start", explode($"splits"))
.withColumn("bucket_end", coalesce(lead($"bucket_start", 1).over(window), lit(Int.MaxValue)))
.withColumn("bucket", row_number().over(window))
val joinCond = "d.%s = s.%s AND d.%s >= s.bucket_start AND d.%s < bucket_end".format(partitionCol, partitionCol, featureCol, featureCol)
df.as("d")
.join(splitsDf.as("s"), expr(joinCond), "inner")
.select($"d.*", $"s.bucket")
}
val data =
List(
("A37", "Histogram.ratio", 1, 0.20),
("A37", "Histogram.ratio", 20, 0.34),
("A37", "Histogram.ratio", 9000, 0.32),
("A49", "Histogram.ratio", 1, 0.50),
("A49", "Histogram.ratio", 20, 0.24)
).toDF("instance", "name", "value", "percentage")
val splits_map = Map("A37" -> Array(0,30,1000,5000,9000), "A49" -> Array(0,10,30,80,998))
val bucketedData = bucketizeWithinPartition(data, splits_map, "instance", "value")
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.