繁体   English   中英

spark rdd:分组和过滤

[英]spark rdd: grouping and filtering

我有一个对象的Rdd“ labResults”:

case class LabResult(patientID: String, date: Long, labName: String, value: String)

我想对这个rdd进行转换,以使每个PatientID和labName组合仅包含一行。 此行应该是PatientID和labName组合的最后一行(我只对患者进行此实验室的最新日期感兴趣)。 我这样做:

//group rows by patient and lab and take only the last one
val cleanLab = labResults.groupBy(x => (x.patientID, x.labName)).map(_._2).map { events =>
  val latest_date = events.maxBy(_.date)
  val lab = events.filter(x=> x.date == latest_date)
  lab.take(1)
}

最近,我想从此RDD创建边缘:

val edgePatientLab: RDD[Edge[EdgeProperty]] = cleanLab
  .map({ lab =>
    Edge(lab.patientID.toLong, lab2VertexId(lab.labName), PatientLabEdgeProperty(lab).asInstanceOf[EdgeProperty])
  })

我收到一个错误:

value patientID is not a member of Iterable[edu.gatech.cse6250.model.LabResult]

[错误] Edge(lab.patientID.toLong,lab2VertexId(lab.labName),PatientLabEdgeProperty(lab).asInstanceOf [EdgeProperty])[错误] ^ [错误] / hw4 / stu_code / src / main / scala / edu / gatech / cse6250 / graphconstruct / GraphLoader.scala:94:53:值labName不是Iterable [edu.gatech.cse6250.model.LabResult]的成员[错误] Edge(lab.patientID.toLong,lab2VertexId(lab.labName),PatientLabEdgeProperty (lab).asInstanceOf [EdgeProperty])[错误] ^ [错误] /hw4/stu_code/src/main/scala/edu/gatech/cse6250/graphconstruct/GraphLoader.scala:94:86:类型不匹配; 发现[错误]:Iterable [edu.gatech.cse6250.model.LabResult]必需[错误]:edu.gatech.cse6250.model.LabResult [错误] Edge(lab.patientID.toLong,lab2VertexId(lab.labName),PatientLabEdgeProperty (实验室).asInstanceOf [EdgeProperty正是])

因此,问题似乎在于“ cleanLab”也不是我预期的LabResult的RDD,而是Iterable [edu.gatech.cse6250.model.LabResult]的RDD。

我该如何解决?

这是我第一部分的方法。 关于Edge以及其他类的内容,我不知道他们是从哪里来的(是从这里来的 ?)

scala> val ds = List(("1", 1, "A", "value 1"), ("1", 3, "A", "value 3"), ("1", 3, "B", "value 3"), ("1", 2, "A", "value 2"), ("1", 3, "B", "value 3"), ("1", 5, "B", "value 5") ).toDF("patientID", "date", "labName", "value").as[LabResult]
ds: org.apache.spark.sql.Dataset[LabResult] = [patientID: string, date: int ... 2 more fields]

scala> ds.show
+---------+----+-------+-------+
|patientID|date|labName|  value|
+---------+----+-------+-------+
|        1|   1|      A|value 1|
|        1|   3|      A|value 3|
|        1|   3|      B|value 3|
|        1|   2|      A|value 2|
|        1|   3|      B|value 3|
|        1|   5|      B|value 5|
+---------+----+-------+-------+


scala> val grouped = ds.groupBy("patientID", "labName").agg(max("date") as "date")
grouped: org.apache.spark.sql.DataFrame = [patientID: string, labName: string ... 1 more field]

scala> grouped.show
+---------+-------+----+
|patientID|labName|date|
+---------+-------+----+
|        1|      A|   3|
|        1|      B|   5|
+---------+-------+----+


scala> val cleanLab = ds.join(grouped, Seq("patientID", "labName", "date")).as[LabResult]
cleanLab: org.apache.spark.sql.Dataset[LabResult] = [patientID: string, labName: string ... 2 more fields]

scala> cleanLab.show
+---------+-------+----+-------+
|patientID|labName|date|  value|
+---------+-------+----+-------+
|        1|      A|   3|value 3|
|        1|      B|   5|value 5|
+---------+-------+----+-------+


scala> cleanLab.head
res45: LabResult = LabResult(1,3,A,value 3)

scala>

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM