简体   繁体   中英

how to match a spark RDD with results from JdbcRDD

I have a mydata:RDD[Vertex] where each Vertex has an id property. I also have data from mysql in a dbdata:JdbcRDD[String], where the first position in each string is occupied by an id value that matches the one in Vertex.id in the RDD collection.

I want to fill each Vertex's variable with data from the JdbcRDD[String], but I don't the more efficient way to match the two datasets. I am trying this but it's not working:

mydata.map(x => val arow = dbdata.filter { y => y.split(",")(0).toString == x.id }.collect
           x.prop1 = arow(1)
           // ... for all other values in arow
)

However, I get a

org.apache.spark.SparkException: RDD transformations and actions can only be invoked by the driver, not inside of other transformations; 

so I guess my approach is wrong...

Your problem sounds like a good fit for a join operation. The following code snippet should do the trick

val mydata: RDD[Vertex] = ...
val dbdata: JdbcRDD[String] = ...

val mydataKV = mydata.map(x => (x.id, x))
val dbdataKV = dbdata.map(x => ((x.split(",")(0), x)))

val result = mydataKV.join(dbdataKV).mapValues{
  case (vertex, db) =>
    // fill vertex fields
    vertex.prop1 = db.split(",")(1)
    ...

    vertex
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM