简体   繁体   English

Spark DataFrame-如何根据条件对数据进行分区

[英]Spark DataFrame - How to partition the data based on condition

Have some employee data set. 设置一些员工数据。 in that i need to partition based employee salary based on some condition. 在这一点上,我需要根据某些条件对基于员工的薪水进行分区。 Created DataFrame and converted to Custom DataFrame Object. 创建了DataFrame并将其转换为Custom DataFrame对象。 Created Custom Partition for salary. 创建工资的自定义分区。

class SalaryPartition(override val numPartitions: Int) extends Partitioner {

  override def getPartition(key: Any): Int =
    {
      import com.csc.emp.spark.tutorial.PartitonObj._
      key.asInstanceOf[Emp].EMPLOYEE_ID match {
        case salary if salary < 10000 => 1
        case salary if salary >= 10001 && salary < 20000 => 2
        case _ => 3
      }

    }

}

Question how can i invoke\\call my custome partition. 问题我该如何调用\\调用我的客户分区。 Couldn't find partitionBy in dataframe. 在数据框中找不到partitionBy。 Have any alternative way? 还有其他方法吗?

Just code for my comment: 只需编写我的评论代码:

val empDS = List(Emp(5, 1000), Emp(4, 15000), Emp(3, 30000), Emp(2, 2000)).toDS()
println(s"Original partitions number: ${empDS.rdd.partitions.size}")
println("-- Original partition: data --")
empDS.rdd.mapPartitionsWithIndex((index, it) => {
  it.foreach(r => println(s"Partition $index: $r")); it
}).count()

val getSalaryGrade = (salary: Int) => salary match {
  case salary if salary < 10000 => 1
  case salary if salary >= 10001 && salary < 20000 => 2
  case _ => 3
}
val getSalaryGradeUDF = udf(getSalaryGrade)
val salaryGraded = empDS.withColumn("salaryGrade", getSalaryGradeUDF($"salary"))

val repartitioned = salaryGraded.repartition($"salaryGrade")
println
println(s"Partitions number after: ${repartitioned.rdd.partitions.size}")
println("-- Reparitioned partition: data --")

repartitioned.as[Emp].rdd.mapPartitionsWithIndex((index, it) => {
  it.foreach(r => println(s"Partition $index: $r")); it
}).count()

Output is: 输出为:

Original partitions number: 2
-- Original partition: data --
Partition 1: Emp(3,30000)
Partition 0: Emp(5,1000)
Partition 1: Emp(2,2000)
Partition 0: Emp(4,15000)

Partitions number after: 5
-- Reparitioned partition: data --
Partition 1: Emp(3,30000)
Partition 3: Emp(5,1000)
Partition 3: Emp(2,2000)
Partition 4: Emp(4,15000)

Note: guess, several partitions possible with the same "salaryGrade". 注意:猜测,使用相同的“ salaryGrade”可能会出现多个分区。

Advice: "groupBy" or similar looks like more reliable solution. 建议: “ groupBy”或类似的方法看起来更可靠。

For stay with Dataset entities, "groupByKey" can be used: 对于数据集实体,可以使用“ groupByKey”:

empDS.groupByKey(x => getSalaryGrade(x.salary)).mapGroups((index, it) => {
  it.foreach(r => println(s"Group $index: $r")); index
}).count()

Output: 输出:

Group 1: Emp(5,1000)
Group 3: Emp(3,30000)
Group 1: Emp(2,2000)
Group 2: Emp(4,15000)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM