简体   繁体   English

是否可以直接在Spark工作者中创建变量?

[英]Is it possible to create a variable directly in Spark workers?

What I would like to do is generate a context inside every Spark worker that I can use for local look-ups. 我想要做的是在每个Spark工作者中生成一个上下文,我可以将其用于本地查找。 The look-up data is located in a database and I would like to cache it on every worker. 查找数据位于数据库中,我想将其缓存在每个工作者上。 Is there a simple way to do this? 有一个简单的方法吗?

Workarounds used: 使用的变通办法:

  1. Create a lazily initialized Broadcast variable and use it with my functions. 创建一个惰性初始化的Broadcast变量,并将其与我的函数一起使用。 The first time a function tries to access it, I call my SQL code to initialize it. 函数第一次尝试访问它时,我调用我的SQL代码来初始化它。
  2. Create an eagerly initialized Broadcast and use torrent broadcasting to make it available in workers 创建一个急切初始化的Broadcast并使用种子广播使其在工作人员中可用

PS. PS。 I did not use JdbcRDD because I want the data to be replicated rather than partitioned. 我没有使用JdbcRDD因为我想要复制数据而不是分区。 Does anyone knows what would happen if I did not use the partitioning attributes of the JdbcRDD ? 有谁知道如果我不使用JdbcRDD的分区属性会发生什么? Would that just make it work or would it have a non-deterministic behavior? 这会让它发挥作用还是会产生非确定性行为?

You could create a singleton object containing a reference to the resolution cache you want to use: 您可以创建一个包含对要使用的分辨率缓存的引用的单例对象:

object ResolutionCache {
   var connection = _
   var cache: Map[Key,Value] = Map()
   def resolve(key:Key):Value = ???
}

Then this object can be used to resolve values in an RDD operation: 然后,此对象可用于解析RDD操作中的值:

val resolved = keysRDD.map(key => (key -> ResolutionCache.resolve(key)))

The connections and values held by this object will be maintained independently per worker JVM. 此对象保存的连接和值将按工作者JVM独立维护。 We must take special care of the connection management and concurrent behavior. 我们必须特别注意连接管理和并发行为。 In particular, resolve must be thread-safe. 特别是, resolve必须是线程安全的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM