简体   繁体   English

Spark流示例使用其他参数调用updateStateByKey

[英]Spark streaming example calls updateStateByKey with additional parameters

Wondering why the StatefulNetworkWordCount.scala example calls the infamous updateStateByKey() function, which is supposed to take a function only as parameter with instead: 想知道为什么StatefulNetworkWordCount.scala示例调用臭名昭着的updateStateByKey()函数,该函数应该仅将函数作为参数使用:

val stateDstream = wordDstream.updateStateByKey[Int](newUpdateFunc,
  new HashPartitioner (ssc.sparkContext.defaultParallelism), true, initialRDD)

Why the need (and how does that get processed - this is not in the signature of updateStateByKey()?) to pass a partitioner, a boolean, and an RDD ? 为什么需要(以及如何处理 - 这不是在updateStateByKey()的签名中?)传递分区器,布尔值和RDD?

thanks, Matt 谢谢,马特

It is because: 这是因为:

  1. You see the different Spark release branch: https://github.com/apache/spark/blob/branch-1.3/examples/src/main/scala/org/apache/spark/examples/streaming/StatefulNetworkWordCount.scala . 您可以看到不同的Spark版本分支: https//github.com/apache/spark/blob/branch-1.3/examples/src/main/scala/org/apache/spark/examples/streaming/StatefulNetworkWordCount.scala In Spark 1.2 this code was with just updateStateByKey receiving a single function as a parameter, while in 1.3 they have optimized it 在Spark 1.2中,此代码仅使用updateStateByKey接收单个函数作为参数,而在1.3中它们已对其进行了优化
  2. Different versions of updateStateByKey exist in both 1.2 and 1.3. 1.2和1.3中都存在不同版本的updateStateByKey But in 1.2 there is no version with 4 parameters, it was introduced only in 1.3: https://github.com/apache/spark/blob/branch-1.3/streaming/src/main/scala/org/apache/spark/streaming/dstream/PairDStreamFunctions.scala 但在1.2中没有4个参数的版本,它仅在1.3中引入: https//github.com/apache/spark/blob/branch-1.3/streaming/src/main/scala/org/apache/spark/流/ DSTREAM / PairDStreamFunctions.scala

Here is the code: 这是代码:

/**
* Return a new "state" DStream where the state for each key is updated by applying
* the given function on the previous state of the key and the new values of each key.
* org.apache.spark.Partitioner is used to control the partitioning of each RDD.
* @param updateFunc State update function. Note, that this function may generate a different
* tuple with a different key than the input key. Therefore keys may be removed
* or added in this way. It is up to the developer to decide whether to
* remember the partitioner despite the key being changed.
* @param partitioner Partitioner for controlling the partitioning of each RDD in the new
* DStream
* @param rememberPartitioner Whether to remember the paritioner object in the generated RDDs.
* @param initialRDD initial state value of each key.
* @tparam S State type
*/
def updateStateByKey[S: ClassTag](
    updateFunc: (Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)],
    partitioner: Partitioner,
    rememberPartitioner: Boolean,
    initialRDD: RDD[(K, S)]
): DStream[(K, S)] = {
    new StateDStream(self, ssc.sc.clean(updateFunc), partitioner,
    rememberPartitioner, Some(initialRDD))
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Spark Streaming groupByKey和updateStateByKey实现 - Spark Streaming groupByKey and updateStateByKey implementation 无法编译Spark Streaming示例:获取updateStateByKey不是org.apache.spark.streaming.dstream.DStream错误的成员 - Cannot compile Spark Streaming example: getting updateStateByKey is not a member of org.apache.spark.streaming.dstream.DStream error 在Spark Streaming中调用updateStateByKey时出错 - Error calling updateStateByKey in Spark Streaming Spark流:在UpdateStateByKey之后重新加入原始流 - Spark Streaming: join back to original stream after UpdateStateByKey Spark Streaming-传递参数问题 - Spark Streaming - Issue with Passing parameters 为什么在Spark Streaming示例的StatefulNetworkWordCount中需要HashPartioner? - Why is HashPartioner needed in the StatefulNetworkWordCount in spark streaming example? 尝试运行简单的Spark Streaming Kafka示例时遇到错误 - Getting an error while trying to run a simple spark streaming kafka example 在Clojure中编写Spark Structured Streaming示例时出错 - Error when writing Spark Structured Streaming example in Clojure updateStateByKey,noClassDefFoundError - updateStateByKey, noClassDefFoundError Spark Streaming标准示例中缺少必需的配置“ bootstrap.servers”错误 - Missing required configuration “bootstrap.servers” error in Spark Streaming standard example
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM