简体   繁体   中英

Group By two different keys in two different DataFrames using Spark Scala without join

I am new with Spark Scala and I want to compute a similarity variable using two dataframes or RDD. I don't have a common key between both of them, I did a cartesian join but the joined Df is huge. Is it possible to compute a new variable from both DF without joining them?

eg:

df1.show
+----+------------+------+
| id1|        food| level|
+----+------------+------+
|id11|       pasta| first|
|id11|       pizza|second|
|id11|   ice cream| first|
|id12|     spanish| first|
|id12|   ice cream|second|
|id13|      fruits| first|
+----+------------+------+
df2.show
+----+---------+
| id2|     food|
+----+---------+
|id21|    pizza|
|id21|   fruits|
|id22|    pasta|
|id22|    pizza|
|id22|ice cream|
+----+---------+

For each id1 from df1, I want to loop food variable from df2 grouping by id2.
I want to get this ouput:

+----+----+----------------+
| id1| id2|count_similarity|
+----+----+----------------+
|id11|id21|               1|id11 and id21 have only "pizza' in common
|id11|id22|               3|
|id12|id21|               0|
|id12|id22|               1|
|id13|id21|               1|
|id13|id22|               0|
+----+----+----------------+

Is it possible to compute this using a map sentence on RDD? Thank you

You can convert both data frames to rdd , use the cartesian method to calculate the similarity between each id pair and then reconstruct the data frame:

case class similarity(id1: String, id2: String, count_similarity: Int)

val rdd1 = df1.rdd.groupBy(_.getString(0)).mapValues(_.map(_.getString(1)).toList)    
val rdd2 = df2.rdd.groupBy(_.getString(0)).mapValues(_.map(_.getString(1)).toList)

rdd1.cartesian(rdd2).map{ 
    case (x, y) => similarity(x._1, y._1, x._2.intersect(y._2).size) 
}.toDF.orderBy("id1").show

+----+----+----------------+
| id1| id2|count_similarity|
+----+----+----------------+
|id11|id22|               3|
|id11|id21|               1|
|id12|id21|               0|
|id12|id22|               1|
|id13|id21|               1|
|id13|id22|               0|
+----+----+----------------+

Would this work for you?

df1.registerTempTable("temp_table_1")
df2.registerTempTable("temp_table_2")

spark.sql(
  """SELECT id1, id2, count(*) AS count_similarity FROM temp_table_1 AS t1
   | JOIN temp_table_2 AS t2 ON (t1.food = t2.food)
   | GROUP BY id1, id2
   | ORDER BY id1, id2""".stripMargin
).show

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM