简体   繁体   中英

Spark scala join RDD between 2 datasets

Supposed i have two dataset as following:

Dataset 1:

id, name, score
1, Bill, 200
2, Bew, 23
3, Amy, 44
4, Ramond, 68

Dataset 2:

id,message
1, i love Bill
2, i hate Bill
3, Bew go go !
4, Amy is the best
5, Ramond is the wrost
6, Bill go go
7, Bill i love ya
8, Ramond is Bad
9, Amy is great

I wanted to join above two datasets and counting the top number of person's name that appears in dataset2 according to the name in dataset1 the result should be:

Bill, 4
Ramond, 2 
..
..

I managed to join both of them together but not sure how to count how many time it appear for each person.

Any suggestion would be appreciated.

Edited: my join code:

val rdd = sc.textFile("dataset1")
val rdd2 = sc.textFile("dataset2")
val rddPair1 = rdd.map { x =>
  var data = x.split(",")
  new Tuple2(data(0), data(1))
}
val rddPair2 = rdd2.map { x =>
  var data = x.split(",")
  new Tuple2(data(0), data(1))
}
rddPair1.join(rddPair2).collect().foreach(f =>{
  println(f._1+" "+f._2._1+" "+f._2._2)
})

Using RDDs , achieving the solution you desire, would be complex. Not so much using dataframes .

First step would be to read the two files you have into dataframes as below

val df1 = sqlContext.read.format("com.databricks.spark.csv")
    .option("header", true)
  .load("dataset1")
val df2 = sqlContext.read.format("com.databricks.spark.csv")
  .option("header", true)
  .load("dataset1")

so that you should be having

df1
+---+------+-----+
|id |name  |score|
+---+------+-----+
|1  |Bill  |200  |
|2  |Bew   |23   |
|3  |Amy   |44   |
|4  |Ramond|68   |
+---+------+-----+

df2
+---+-------------------+
|id |message            |
+---+-------------------+
|1  |i love Bill        |
|2  |i hate Bill        |
|3  |Bew go go !        |
|4  |Amy is the best    |
|5  |Ramond is the wrost|
|6  |Bill go go         |
|7  |Bill i love ya     |
|8  |Ramond is Bad      |
|9  |Amy is great       |
+---+-------------------+

join , groupBy and count should give your desired output as

df1.join(df2, df2("message").contains(df1("name")), "left").groupBy("name").count().as("count").show(false)

Final output would be

+------+-----+
|name  |count|
+------+-----+
|Ramond|2    |
|Bill  |4    |
|Amy   |2    |
|Bew   |1    |
+------+-----+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM