在Spark中加入非对RDD

Question

In the context of telecom industry, let's supose we have several existing RDDs populated from some tables in Cassandra: 在电信行业的背景下，让我们知道我们在Cassandra的一些表中填充了几个现有的RDD：

val callPrices: RDD[PriceRow]
val calls: RDD[CallRow]
val offersInCourse: RDD[OfferRow]

where types are defined as follows, 其中类型定义如下，

/** Represents the price per minute for a concrete hour */
case class PriceRow(
    val year: Int,
    val month: Int,
    val day: Int,
    val hour: Int,
    val basePrice: Float)

/** Call registries*/
case class CallRow(
    val customer: String,
    val year: Int,
    val month: Int,
    val day: Int,
    val hour: Int,
    val minutes: Int)

/** Is there any discount that could be applicable here? */
case class OfferRow(
    val offerName: String,
    val hour: Int,//[0..23]
    val discount: Float)//[0..1]

Assuming we cannot use flatMap to mix these three RDDs like this way (since RDD is not really 'monadic'), 假设我们不能像这样使用flatMap来混合这三个RDD（因为RDD不是真的'monadic'），

/** 
 * The final bill at a concrete hour for a call 
 * is defined as {{{ 
 *    def billPerHour(minutes: Int,basePrice:Float,discount:Float) = 
 *      minutes * basePrice * discount
 * }}}
 */
val bills: RDD[BillRow] = for{
    price <- callPrices
    call <- calls if call.hour==price.hour
    offer <- offersInCourse if offer.hour==price.hour
} yield BillRow(
    call.customer,
    call.hour,
    billPerHour(call.minutes,price.basePrice,offer.discount))

case class BillRow(
    val customer: String,
    val hour: DateTime,
    val amount: Float)

which is the best practise for generating a new RDD that join all these three RDDs and represents the bill for a concrete customer? 这是生成连接所有这三个RDD的新RDD的最佳实践，并代表具体客户的账单？

Answer 1

Finally I've used the approach that Daniel suggested me at spark mail list . 最后，我使用了丹尼尔在火花邮件列表中建议我的方法。

So I resolved it as follows: 所以我解决了如下：

type Key = (Int,Int,Int,Int)
type Price = Int
type CustomerCall = (String,Int)
type Offer = (String,Float)

val keyedCallPricesRdd: RDD[(Key,Price)] = callPrices.map{
    case PriceRow(year,month,day,hour,basePrice) =>
        ((year,month,day,hour),basePrice)
}
val keyedCallsRdd: RDD[(Key,CustomerCall)] = calls.map{
    case CallRow(customer,year,month,day,hour,minutes) =>
        ((year,month,day,hour),(customer,minutes))
}
val keyedOffersRdd: RDD[(Key,Offer)] = for{
    offer <- offersInCourse
    year <- List(2013,2014) //possible years I want to calculate
    month <- 1 to 12
    day <- 1 to 31
} yield ((year,month,day,offer.hour),(offer.offerName,offer.discount))

import org.apache.spark.SparkContext._

keyedCallPricesRdd
    .join(keyedCallsRdd)
    .join(keyedOffersRdd)
    .map { 

    case (key:Key,(price:Price,call:CustomerCall,offer:Offer)) => 
        //do whatever you need... 

}

在Spark中加入非对RDD

问题描述

1 个解决方案

解决方案1
3 已采纳 2014-04-30 10:08:10

在Spark中加入非对RDD

问题描述

1 个解决方案

解决方案1 3 已采纳 2014-04-30 10:08:10

解决方案1
3 已采纳 2014-04-30 10:08:10