In the context of telecom industry, let's supose we have several existing RDDs populated from some tables in Cassandra:
val callPrices: RDD[PriceRow]
val calls: RDD[CallRow]
val offersInCourse: RDD[OfferRow]
where types are defined as follows,
/** Represents the price per minute for a concrete hour */
case class PriceRow(
val year: Int,
val month: Int,
val day: Int,
val hour: Int,
val basePrice: Float)
/** Call registries*/
case class CallRow(
val customer: String,
val year: Int,
val month: Int,
val day: Int,
val hour: Int,
val minutes: Int)
/** Is there any discount that could be applicable here? */
case class OfferRow(
val offerName: String,
val hour: Int,//[0..23]
val discount: Float)//[0..1]
Assuming we cannot use flatMap
to mix these three RDDs like this way (since RDD is not really 'monadic'),
/**
* The final bill at a concrete hour for a call
* is defined as {{{
* def billPerHour(minutes: Int,basePrice:Float,discount:Float) =
* minutes * basePrice * discount
* }}}
*/
val bills: RDD[BillRow] = for{
price <- callPrices
call <- calls if call.hour==price.hour
offer <- offersInCourse if offer.hour==price.hour
} yield BillRow(
call.customer,
call.hour,
billPerHour(call.minutes,price.basePrice,offer.discount))
case class BillRow(
val customer: String,
val hour: DateTime,
val amount: Float)
which is the best practise for generating a new RDD that join all these three RDDs and represents the bill for a concrete customer?
Finally I've used the approach that Daniel suggested me at spark mail list .
So I resolved it as follows:
type Key = (Int,Int,Int,Int)
type Price = Int
type CustomerCall = (String,Int)
type Offer = (String,Float)
val keyedCallPricesRdd: RDD[(Key,Price)] = callPrices.map{
case PriceRow(year,month,day,hour,basePrice) =>
((year,month,day,hour),basePrice)
}
val keyedCallsRdd: RDD[(Key,CustomerCall)] = calls.map{
case CallRow(customer,year,month,day,hour,minutes) =>
((year,month,day,hour),(customer,minutes))
}
val keyedOffersRdd: RDD[(Key,Offer)] = for{
offer <- offersInCourse
year <- List(2013,2014) //possible years I want to calculate
month <- 1 to 12
day <- 1 to 31
} yield ((year,month,day,offer.hour),(offer.offerName,offer.discount))
import org.apache.spark.SparkContext._
keyedCallPricesRdd
.join(keyedCallsRdd)
.join(keyedOffersRdd)
.map {
case (key:Key,(price:Price,call:CustomerCall,offer:Offer)) =>
//do whatever you need...
}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.