简体   繁体   English

Spark在RDD的每次迭代中查找先前的值

[英]Spark find previous value on each iteration of RDD

I've following code:-我有以下代码: -

val rdd = sc.cassandraTable("db", "table").select("id", "date", "gpsdt").where("id=? and date=? and gpsdt>? and gpsdt<?", entry(0), entry(1), entry(2) , entry(3))
val rddcopy = rdd.sortBy(row => row.get[String]("gpsdt"), false).zipWithIndex()
rddcopy.foreach { records =>
  {
    val previousRow = (records - 1)th row
    val currentRow = records
// Some calculation based on both rows 
    }
}

So, Idea is to get just previous \ next row on each iteration of RDD.因此,想法是在 RDD 的每次迭代中获取上一行 \ 下一行。 I want to calculate some field on current row based on the value present on previous row.我想根据前一行的值计算当前行的某个字段。 Thanks,谢谢,

EDIT II: Misunderstood question below is how to get tumbling window semantics but sliding window is needed.编辑二:下面被误解的问题是如何获得翻转 window 语义但需要滑动 window 。 considering this is a sorted RDD考虑到这是一个排序的 RDD

import org.apache.spark.mllib.rdd.RDDFunctions._
sortedRDD.sliding(2)

should do the trick.应该做的伎俩。 Note however that this is using a DeveloperAPI.但是请注意,这是使用 DeveloperAPI。

alternatively you can或者你可以

val l = sortedRdd.zipWithIndex.map(kv => (kv._2, kv._1))
val r = sortedRdd.zipWithIndex.map(kv => (kv._2-1, kv._1))
val sliding = l.join(r)

rdd joins should be inner joins (IIRC) thus dropping the edge cases where the tuples would be partially null rdd 连接应该是内部连接(IIRC),从而丢弃元组部分为 null 的边缘情况

OLD STUFF:老东西:

how do you do identify the previous row?你如何识别前一行? RDDs do not have any sort of stable ordering by themselves. RDD 本身没有任何稳定的排序。 if you have an incrementing dense key you could add a new column that get's calculated the following way if (k % 2 == 0) k / 2 else (k-1)/2 this should give you a key that has the same value for two successive keys.如果您有一个递增的密集键,您可以添加一个新列,该列按以下方式计算if (k % 2 == 0) k / 2 else (k-1)/2这应该给您一个具有相同值的键对于两个连续的键。 Then you could just group by.然后你就可以分组了。

But to reiterate there is no really sensible notion of previous in most cases for RDDs (depending on partitioning, datasource etc.)但重申一下,在大多数情况下,对于 RDD(取决于分区、数据源等),没有真正明智的previous概念。

EDIT: so now that you have a zipWithIndex and an ordering in your set you can do what I mentioned above.编辑:现在你的集合中有一个 zipWithIndex 和一个排序,你可以做我上面提到的事情。 So now you have an RDD[(Int, YourData)] and can do所以现在你有一个RDD[(Int, YourData)]并且可以做

rdd.map( kv => if (kv._1 % 2 == 0) (kv._1 / 2, kv._2) else ( (kv._1 -1) /2, kv._2 ) ).groupByKey.foreach (/* your stuff here /*)

if you reduce at any point consider using reduceByKey rather than groupByKey().reduce如果您在任何时候减少,请考虑使用reduceByKey而不是groupByKey().reduce

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM