简体   繁体   English

什么是 DataFrames 中 reduceByKey 的类似替代方案

[英]What is the similar alternative to reduceByKey in DataFrames

Give following code给出以下代码

case class Contact(name: String, phone: String)
case class Person(name: String, ts:Long, contacts: Seq[Contact])

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
import sqlContext.implicits._
val people = sqlContext.read.format("orc").load("people")

What is the best way to dedupe users by its timestamp So the user with max ts will stay at collection?通过时间戳对用户进行重复数据删除的最佳方法是什么因此具有最大 ts 的用户将留在收集中? In spark using RDD I would run something like this在使用 RDD 的火花中,我会运行这样的东西

rdd.reduceByKey(_ maxTS _) 

and would add the maxTS method to Person or add implicits ...并将 maxTS 方法添加到 Person 或添加隐式 ...

def maxTS(that: Person):Person =
that.ts > ts match {
  case true => that
  case false => this
}

Is it possible to do the same at DataFrames?是否可以在 DataFrames 上做同样的事情? and will that be the similar performance?那会是类似的表现吗? We are using spark 1.6我们正在使用火花 1.6

You can use Window functions, I'm assuming that the key is name : 您可以使用Window函数,我假设键是name

import org.apache.spark.sql.functions.{rowNumber, max, broadcast}
import org.apache.spark.sql.expressions.Window
val df = // convert to DataFrame
val win = Window.partitionBy('name).orderBy('ts.desc)
df.withColumn("personRank", rowNumber.over(win))
  .where('personRank === 1).drop("personRank")

For each person it will create personRank - each person with given name will have unique number, person with the latest ts will have the lowest rank, equal to 1. The you drop temporary rank 对于每个人,它将创建personRank-每个具有给定名称的人将具有唯一编号,具有最新ts的人将具有最低等级,等于1。

您可以进行groupBy并使用首选的聚集方法,例如sum,max等。

df.groupBy($"name").agg(sum($"tx").alias("maxTS"))

您不能只通过调用Dataframe.rdd来做您一直在做的事情吗?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM