Spark dropDuplicates源代碼

Question

我正在研究Spark源代碼，以查看dropDuplicates方法如何工作。 在方法定義中，有一個方法“ Deduplicate調用。 但是我找不到它的定義或參考。 如果有人能指出我正確的方向，那就太好了。 鏈接在這里。

Answer 1

它在火花催化劑中，請參見此處。

由於實現有些混亂，因此我將添加一些解釋。

當前Deduplicate實現是：

/** A logical plan for `dropDuplicates`. */
case class Deduplicate(
    keys: Seq[Attribute],
    child: LogicalPlan) extends UnaryNode {

  override def output: Seq[Attribute] = child.output
}

目前尚不清楚這里發生了什么，但是如果您查看Optimizer類，您將看到ReplaceDeduplicateWithAggregate對象，然后它將變得更加清晰。

/**
 * Replaces logical [[Deduplicate]] operator with an [[Aggregate]] operator.
 */
object ReplaceDeduplicateWithAggregate extends Rule[LogicalPlan] {
  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
    case Deduplicate(keys, child) if !child.isStreaming =>
      val keyExprIds = keys.map(_.exprId)
      val aggCols = child.output.map { attr =>
        if (keyExprIds.contains(attr.exprId)) {
          attr
        } else {
          Alias(new First(attr).toAggregateExpression(), attr.name)(attr.exprId)
        }
      }
      // SPARK-22951: Physical aggregate operators distinguishes global aggregation and grouping
      // aggregations by checking the number of grouping keys. The key difference here is that a
      // global aggregation always returns at least one row even if there are no input rows. Here
      // we append a literal when the grouping key list is empty so that the result aggregate
      // operator is properly treated as a grouping aggregation.
      val nonemptyKeys = if (keys.isEmpty) Literal(1) :: Nil else keys
      Aggregate(nonemptyKeys, aggCols, child)
  }
}

底線，對於df col1, col2, col3, col4

df.dropDuplicates("col1", "col2")

或多或少

df.groupBy("col1", "col2").agg(first("col3"), first("col4"))

Spark dropDuplicates源代碼

問題描述

1 個解決方案

解決方案1
4 2018-06-20 13:21:03

Spark dropDuplicates源代碼

問題描述

1 個解決方案

解決方案1 4 2018-06-20 13:21:03

解決方案1
4 2018-06-20 13:21:03