简体   繁体   English

在不计算的情况下获取 Spark 数据帧中的行数

[英]Getting the number of rows in a Spark dataframe without counting

I am applying many transformations on a Spark DataFrame (filter, groupBy, join).我在 Spark DataFrame(过滤器、groupBy、join)上应用了许多转换。 I want to have the number of rows in the DataFrame after each transformation.我想在每次转换后获得 DataFrame 中的行数。

I am currently counting the number of rows using the function count() after each transformation, but this triggers an action each time which is not really optimized.我目前正在每次转换后使用函数 count() 计算行数,但这每次都会触发一个并未真正优化的操作。

I was wondering if there is any way of knowing the number of rows without having to trigger another action than the original job.我想知道是否有任何方法可以知道行数而不必触发除原始作业之外的其他操作。

You could use an accumulator for each stage and increment the accumulator in a map after each stage.您可以为每个阶段使用一个累加器,并在每个阶段之后在地图中增加累加器。 Then at the end after you do your action you would have a count for all the stages.然后在你做完你的动作之后,你就会对所有阶段进行计数。

val filterCounter = spark.sparkContext.longAccumulator("filter-counter")
val groupByCounter = spark.sparkContext.longAccumulator("group-counter")
val joinCounter = spark.sparkContext.longAccumulator("join-counter")

myDataFrame
    .filter(col("x") === lit(3))
    .map(x => {
      filterCounter.add(1)
      x
    })        .groupBy(col("x"))
    .agg(max("y"))
    .map(x => {
      groupByCounter.add(1)
      x
    })
    .join(myOtherDataframe, col("x") === col("y"))
    .map(x => {
      joinCounter.add(1)
      x
    })
    .count()

print(s"count for filter = ${filterCounter.value}")
print(s"count for group by = ${groupByCounter.value}")
print(s"count for join = ${joinCounter.value}")

Each operator in itself has couple of metrics.每个运营商本身都有几个指标。 These metrics are visible in the spark UI,'s SQL tab.这些指标在 spark UI 的 SQL 选项卡中可见。

If SQL is not used, we could introspect the query execution object of the data frame after execution, to access the metrics (internally accumulators) .如果不使用 SQL,我们可以在执行后内省数据帧的查询执行对象,以访问指标(内部累加器)

Example: df.queryExecution.executedPlan.metrics will give the metrics of the top most node in DAG.示例: df.queryExecution.executedPlan.metrics将给出 DAG 中最顶层节点的指标。

Coming back to this question after a bit more experience on Apache Spark to complement randal's answer.在对 Apache Spark 有更多经验以补充 randal 的答案后,回到这个问题。

You can also use an UDF to increment a counter.您还可以使用 UDF 来增加计数器。

val filterCounter = spark.sparkContext.longAccumulator("filter-counter")
val groupByCounter = spark.sparkContext.longAccumulator("group-counter")
val joinCounter = spark.sparkContext.longAccumulator("join-counter")

def countUdf(acc: LongAccumulator): UserDefinedFunction = udf { (x: Int) =>
  acc.add(1)
  x
}

myDataFrame
  .filter(col("x") === lit(3))
  .withColumn("x", countUdf(filterCounter)(col("x")))
  .groupBy(col("x"))
  .agg(max("y"))
  .withColumn("x", countUdf(groupByCounter)(col("x")))
  .join(myOtherDataframe, col("x") === col("y"))
  .withColumn("x", countUdf(joinCounter)(col("x")))
  .count()

print(s"count for filter = ${filterCounter.value}")
print(s"count for group by = ${groupByCounter.value}")
print(s"count for join = ${joinCounter.value}")

This should be more efficient because spark will only have to deserialize the column used in the UDF, but has to be carefully used as catalyst can more easily reorder the operations (like pushing a filter before the call to the udf)这应该更有效,因为 spark 只需要反序列化 UDF 中使用的列,但必须小心使用,因为催化剂可以更轻松地重新排序操作(例如在调用 udf 之前推送过滤器)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM