简体   繁体   English

批量查找Spark流数据

[英]Batch lookup data for Spark streaming

I need to look up some data in a Spark-streaming job from a file on HDFS This data is fetched once a day by a batch job. 我需要从HDFS上的文件中查找Spark流作业中的一些数据。批处理作业每天获取一次该数据。
Is there a " design pattern " for such a task? 有这样的任务有“ 设计模式 ”吗?

  • how can I reload the data in memory (a hashmap) immediately after a 我怎样才能在重新加载后立即将数据重新加载到内存(哈希图)中
    daily update? 每日更新?
  • how to serve the streaming job continously while this lookup data is 在此查找数据如何处理的同时,如何连续地服务流工作
    being fetched? 被拿来吗?

One possible approach is to drop local data structures and use stateful stream instead. 一种可能的方法是删除本地数据结构并改用有状态流。 Lets assume you have main data stream called mainStream : 假设您有一个名为mainStream主数据流:

val mainStream: DStream[T] = ???

Next you can create another stream which reads lookup data: 接下来,您可以创建另一个读取查询数据的流:

val lookupStream: DStream[(K, V)] = ???

and a simple function which can be used to update state 和一个简单的功能,可以用来更新状态

def update(
  current: Seq[V],  // A sequence of values for a given key in the current batch
  prev: Option[V]   // Value for a given key from in the previous state
): Option[V] = { 
  current
    .headOption    // If current batch is not empty take first element 
    .orElse(prev)  // If it is empty (None) take previous state
 }

This two pieces can be used to create state: 这两个片段可用于创建状态:

val state = lookup.updateStateByKey(update)

All whats left is to key-by mainStream and connect data: 剩下的就是通过mainStream输入密钥并连接数据:

def toPair(t: T): (K, T) = ???

mainStream.map(toPair).leftOuterJoin(state)

While this is probably less than optimal from a performance point of view it leverages architecture which is already in place and frees you from manually dealing with invalidation or failure recovery. 尽管从性能的角度来看这可能不是最佳选择,但它利用了已经存在的体系结构,使您无需手动处理失效或故障恢复。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM