[英]How to calculate difference of time between two records using Scala?
我想使用Scala计算会话事件之间的时间差。
-GIVEN Source是一个csv文件,如下所示:
HEADER
"session","events","timestamp","Records"
DATA
"session_1","event_1","2015-01-01 10:10:00",100
"session_1","event_2","2015-01-01 11:00:00",500
"session_1","event_3","2015-01-01 11:30:00",300
"session_1","event_4","2015-01-01 11:45:00",300
"session_2","event_1","2015-01-01 10:10:00",100
"session_2","event_2","2015-01-01 11:00:00",500
所需的输出
HEADER
"session","events","time_spent_in_minutes","total_records"
DATA
"session_1","event_1","50",100
"session_1","event_2","30",600
"session_1","event_3","15",900
"session_1","event_4","0",1200
"session_2","event_1","50",100
"session_2","event_2","0",600
其中time_spend_in_minutes是给定会话的current_event和下一个事件之间的差。 标头不是目标中必需的,但很容易拥有。
我是Scala的新手,所以这里是我到目前为止所拥有的:
$ cat test.csv
"session_1","event_1","2015-01-01 10:10:00",100
"session_1","event_2","2015-01-01 11:00:00",500
"session_1","event_3","2015-01-01 11:30:00",300
"session_1","event_4","2015-01-01 11:45:00",300
"session_2","event_1","2015-01-01 10:10:00",100
"session_2","event_2","2015-01-01 11:00:00",500
scala> val sessionFile = sc.textFile("test.csv").
map(_.split(',')).
map(e => (e(1).trim, Sessions(e(0).trim,e(1).trim,e(2).trim,e(3).trim.toInt))).
foreach(println)
("event_1",Sessions("session_2","event_1","2015-01-01 10:10:00",100))
("event_1",Sessions("session_1","event_1","2015-01-01 10:10:00",100))
("event_2",Sessions("session_2","event_2","2015-01-01 11:00:00",500))
("event_2",Sessions("session_1","event_2","2015-01-01 11:00:00",500))
("event_3",Sessions("session_1","event_3","2015-01-01 11:30:00",300))
("event_4",Sessions("session_1","event_4","2015-01-01 11:45:00",300))
sessionFile: Unit = ()
scala>
这是使用joda时间库的解决方案。
val input =
""""session_1","event_1","2015-01-01 10:10:00",100
"session_1","event_2","2015-01-01 11:00:00",500
"session_1","event_3","2015-01-01 11:30:00",300
"session_1","event_4","2015-01-01 11:45:00",300
"session_2","event_1","2015-01-01 10:10:00",100
"session_2","event_2","2015-01-01 11:00:00",500"""
从文本输入创建RDD,可以使用sc.textFile
从文件中sc.textFile
import org.joda.time.format._
import org.joda.time._
def strToTime(s: String):Long = {
DateTimeFormat.forPattern(""""yyyy-MM-dd HH:mm:ss"""")
.parseDateTime(s).getMillis()/1000
}
val r1 = sc.parallelize(input.split("\n"))
.map(_.split(","))
.map(x => (x(0), (x(1), x(2), x(3))))
.groupBy(_._1)
.map(_._2.map{ case(s, (e, timestr, r)) =>
(s, (e, strToTime(timestr), r))}
.toArray
.sortBy( z => z match {
case (session, (event, time, records)) => time}))
将时间从“ 2015-01-01 10:10:00”转换为从纪元开始的秒,并按时间排序。
val r2 = r1.map(x => x :+ { val y = x.last;
y match {
case (session, (event, time, records)) =>
(session, (event, time, "0")) }})
在每个会话中添加了一个额外的事件,除记录计数外,所有参数与会话的最后一个事件相同。 这允许持续时间计算在最后一个事件中提供“ 0”。
使用sliding
获取事件对。
val r3 = r2.map(x => x.sliding(2).toArray)
val r4 = r3.map(x => x.map{
case Array((s1, (e1, t1, c1)), (s2, (e2, t2, c2))) =>
(s1, (e1, (t2 - t1)/60, c1)) } )
使用scan
以增量方式添加记录数。
val r5 = r4.map(x => x.zip(x.map{ case (s, (e, t, r)) => r.toInt}
.scan(0)(_+_)
.drop(1)))
val r6 = r5.map(x => x.map{ case ((s, (e, t, r)), recordstillnow) =>
s"${s},${e},${t},${recordstillnow}" })
val r7 = r6.flatMap(x => x)
r7.collect.mkString("\n")
//"session_2","event_1",50,100
//"session_2","event_2",0,600
//"session_1","event_1",50,100
//"session_1","event_2",30,600
//"session_1","event_3",15,900
//"session_1","event_4",0,1200
尝试这样的事情:
import org.joda.time.format._
import org.joda.time._
val d1 = DateTime.parse("2015-03-03", DateTimeFormat.forPattern("yyyy-MM-dd"))
val d2 = DateTime.parse("2015-03-04", DateTimeFormat.forPattern("yyyy-MM-dd"))
d1.getMillis() - d2.getMillis()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.