批量查找Spark流数据

Question

I need to look up some data in a Spark-streaming job from a file on HDFS This data is fetched once a day by a batch job. 我需要从HDFS上的文件中查找Spark流作业中的一些数据。批处理作业每天获取一次该数据。
Is there a " design pattern " for such a task? 有这样的任务有“ 设计模式 ”吗？

how can I reload the data in memory (a hashmap) immediately after a 我怎样才能在重新加载后立即将数据重新加载到内存（哈希图）中
daily update? 每日更新？
how to serve the streaming job continously while this lookup data is 在此查找数据如何处理的同时，如何连续地服务流工作
being fetched? 被拿来吗？

Answer 1

One possible approach is to drop local data structures and use stateful stream instead. 一种可能的方法是删除本地数据结构并改用有状态流。 Lets assume you have main data stream called mainStream : 假设您有一个名为mainStream主数据流：

val mainStream: DStream[T] = ???

Next you can create another stream which reads lookup data: 接下来，您可以创建另一个读取查询数据的流：

val lookupStream: DStream[(K, V)] = ???

and a simple function which can be used to update state 和一个简单的功能，可以用来更新状态

def update(
  current: Seq[V],  // A sequence of values for a given key in the current batch
  prev: Option[V]   // Value for a given key from in the previous state
): Option[V] = { 
  current
    .headOption    // If current batch is not empty take first element 
    .orElse(prev)  // If it is empty (None) take previous state
 }

This two pieces can be used to create state: 这两个片段可用于创建状态：

val state = lookup.updateStateByKey(update)

All whats left is to key-by mainStream and connect data: 剩下的就是通过mainStream输入密钥并连接数据：

def toPair(t: T): (K, T) = ???

mainStream.map(toPair).leftOuterJoin(state)

While this is probably less than optimal from a performance point of view it leverages architecture which is already in place and frees you from manually dealing with invalidation or failure recovery. 尽管从性能的角度来看这可能不是最佳选择，但它利用了已经存在的体系结构，使您无需手动处理失效或故障恢复。

批量查找Spark流数据

问题描述

1 个解决方案

解决方案1
2 已采纳 2016-05-26 03:28:01

批量查找Spark流数据

问题描述

1 个解决方案

解决方案1 2 已采纳 2016-05-26 03:28:01

解决方案1
2 已采纳 2016-05-26 03:28:01