简体   繁体   中英

Deduping evnts using hiveContext in spark with scala

I am trying to dedupe event records, using the hiveContext in spark with Scala. df to rdd is compilation error saying "object Tuple23 is not a member of package scala". There is known issue, that Scala Tuple can't have 23 or more Is there any other way to dedupe

val events = hiveContext.table("default.my_table")
val valid_events = events.select(
                              events("key1"),events("key2"),events("col3"),events("col4"),events("col5"),
                              events("col6"),events("col7"),events("col8"),events("col9"),events("col10"),
                              events("col11"),events("col12"),events("col13"),events("col14"),events("col15"),
                              events("col16"),events("col17"),events("col18"),events("col19"),events("col20"),
                              events("col21"),events("col22"),events("col23"),events("col24"),events("col25"),
                              events("col26"),events("col27"),events("col28"),events("col29"),events("epoch")
                              )
//events are deduped based on latest epoch time
val valid_events_rdd = valid_events.rdd.map(t => {
                                                  ((t(0),t(1)),(t(2),t(3),t(4),t(5),t(6),t(7),t(8),t(9),t(10),t(11),t(12),t(13),t(14),t(15),t(16),t(17),t(18),t(19),t(20),t(21),t(22),t(23),t(24),t(25),t(26),t(28),t(29)))
                                              })

// reduce by key so we will only get one record for every primary key
val reducedRDD =  valid_events_rdd.reduceByKey((a,b) => if ((a._29).compareTo(b._29) > 0) a else b)
//Get all the fields
reducedRDD.map(r => r._1 + "," + r._2._1 + "," + r._2._2).collect().foreach(println)

Off the top of my head:

  • use cases classes which no longer have size limit. Just keep in mind that cases classes won't work correctly in Spark REPL,
  • use Row objects directly and extract only keys,
  • operate directly on a DataFrame ,

     import org.apache.spark.sql.functions.{col, max} val maxs = df .groupBy(col("key1"), col("key2")) .agg(max(col("epoch")).alias("epoch")) .as("maxs") df.as("df") .join(maxs, col("df.key1") === col("maxs.key1") && col("df.key2") === col("maxs.key2") && col("df.epoch") === col("maxs.epoch")) .drop(maxs("epoch")) .drop(maxs("key1")) .drop(maxs("key2")) 

    or with window function:

     val w = Window.partitionBy($"key1", $"key2").orderBy($"epoch") df.withColumn("rn_", rowNumber.over(w)).where($"rn" === 1).drop("rn") 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM