简体   繁体   English

Spark中RDD的映射方法

[英]Mapping method to RDD in Spark

I'm trying to do stuff with Spark (in Scala). 我正在尝试使用Spark(在Scala中)进行操作。 I have a Transformer class that looks like this: 我有一个看起来像这样的Transformer类:

class Transformer(transformerParameters: TransformerParameters) {
  // Process the parameters

  def transform(element: String): String = {
    // Do stuff
  }
}

I would like to do something like 我想做类似的事情

val originalRDD = sc.textFile("blah")
val transformer = new Transformer(parameters)
val transformedRDD = originalRDD.map(transformer.transform)

Assuming I don't want to or can't make the Transformer class serializable, and further assuming that TransformerParameters is in fact serializable, I've seen people suggest writing instead (or I may have misunderstood): 假设我不想或不能使Transformer类可序列化,并进一步假设TransformerParameters实际上是可序列化的,我已经看到人们建议改写(否则我可能会误解):

val transformedRDD = originalRDD.map(new Transformer(parameters).transform)

I'm fine with creating a new Transformer instance on each JVM of the cluster, but it looks like this creates a new Transformer for every line, which seems unnecessary and potentially very expensive. 我可以在集群的每个JVM上创建一个新的Transformer实例,但是这似乎为每行创建一个新的Transformer ,这似乎是不必要的,而且可能非常昂贵。 Is this actually what it does? 这是真的吗? Is there a way not to make a new instance for every line? 有没有办法不为每行创建一个新实例?

Thanks! 谢谢!

You could broadcast (implicitly or explicitly) an object that has a field for the parameters, as well as a transient field reference to a Transformer. 您可以广播(隐式或显式)一个对象,该对象具有用于参数的字段以及对Transformer的瞬态字段引用。

You have a method on this object that delegates to transform on the Transformer, but first does lazy initialisation of the Transformer (check to see if the Transformer reference is initialised, if not create one with the parameters, then call transform). 您在此对象上有一个方法委托在Transformer上进行变换,但首先对Transformer进行延迟初始化(检查是否已初始化Transformer引用,如果未使用参数创建引用,则调用transform)。

In the map method you then call wrapper.transform rather than Transformer.transform - this saves the object creation on each call, and works around the serialization problem as each task gets it's own wrapper instance and hence it's own Transformer that will get reused. 然后在map方法中调用wrapper.transform而不是Transformer.transform-这样可以保存每次调用时的对象创建,并解决序列化问题,因为每个任务都将获得其自己的包装实例,因此将被自己的Transformer重用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM