I am new with Spark Scala and I want to compute a similarity variable using two dataframes or RDD. I don't have a common key between both of them, I did a cartesian join but the joined Df is huge. Is it possible to compute a new variable from both DF without joining them?
eg:
df1.show
+----+------------+------+
| id1| food| level|
+----+------------+------+
|id11| pasta| first|
|id11| pizza|second|
|id11| ice cream| first|
|id12| spanish| first|
|id12| ice cream|second|
|id13| fruits| first|
+----+------------+------+
df2.show
+----+---------+
| id2| food|
+----+---------+
|id21| pizza|
|id21| fruits|
|id22| pasta|
|id22| pizza|
|id22|ice cream|
+----+---------+
For each id1 from df1, I want to loop food variable from df2 grouping by id2.
I want to get this ouput:
+----+----+----------------+
| id1| id2|count_similarity|
+----+----+----------------+
|id11|id21| 1|id11 and id21 have only "pizza' in common
|id11|id22| 3|
|id12|id21| 0|
|id12|id22| 1|
|id13|id21| 1|
|id13|id22| 0|
+----+----+----------------+
Is it possible to compute this using a map sentence on RDD? Thank you
You can convert both data frames to rdd
, use the cartesian
method to calculate the similarity between each id pair and then reconstruct the data frame:
case class similarity(id1: String, id2: String, count_similarity: Int)
val rdd1 = df1.rdd.groupBy(_.getString(0)).mapValues(_.map(_.getString(1)).toList)
val rdd2 = df2.rdd.groupBy(_.getString(0)).mapValues(_.map(_.getString(1)).toList)
rdd1.cartesian(rdd2).map{
case (x, y) => similarity(x._1, y._1, x._2.intersect(y._2).size)
}.toDF.orderBy("id1").show
+----+----+----------------+
| id1| id2|count_similarity|
+----+----+----------------+
|id11|id22| 3|
|id11|id21| 1|
|id12|id21| 0|
|id12|id22| 1|
|id13|id21| 1|
|id13|id22| 0|
+----+----+----------------+
Would this work for you?
df1.registerTempTable("temp_table_1")
df2.registerTempTable("temp_table_2")
spark.sql(
"""SELECT id1, id2, count(*) AS count_similarity FROM temp_table_1 AS t1
| JOIN temp_table_2 AS t2 ON (t1.food = t2.food)
| GROUP BY id1, id2
| ORDER BY id1, id2""".stripMargin
).show
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.