![](/img/trans.png)
[英]How to get minimum value for each distinct key using ReduceByKey() in Scala
[英]how to replace distinct() with reducebykey
我有一個場景,下面的代碼總共需要 10 多個小時才能獲得超過 20 億條記錄。 即使我嘗試了 35 個 i3 集群實例,但性能仍然很差。 我正在尋找一種用 reduceByKey() 替換 distinct() 的選項,並獲得提高性能的建議......
val df = spark.read.parquet(out)
val df1 = df.
select($"ID", $"col2", $"suffix",
$"date", $"year", $"codes")
val df2 = df1.
repartition(
List(col("ID"), col("col2"), col("suffix"), col("date"),
col("year"), col("codes")): _*
).distinct()
val df3 = df2.withColumn("codes", expr("transform(codes, (c,s) -> (d,s) )"))
df3.createOrReplaceTempView("df3")
val df4 = spark.sql(
"""SELECT
ID, col2, suffix
d.s as seq,
d.c as code,
year,date
FROM
df3
LATERAL VIEW explode(codes) exploded_table as d
""")
df4.
repartition(
600,
List(col("year"), col("date")): _*).
write.
mode("overwrite").
partitionBy("year", "date").
save(OutDir)
我認為distinct()是用reduceByKey (reduce) 實現的,但是如果你想自己實現它,你可以做點什么
val array=List((1,2),(1,3),(1,5),(1,2),(2,2),(2,2),(3,2),(3,2),(4,1),(1,3))
val pairRDD=session.sparkContext.parallelize(array)
val distinctResult=pairRDD.map(x => (x, null)).reduceByKey((x, _) => x)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.