简体   繁体   English

Apache Spark Broadcast变量的类型为Broadcast? 不是RDD吗?

[英]Apache Spark Broadcast variables are type Broadcast? Not a RDD?

Just trying to clarify something, some low-hanging fruit, a question generated by watching a user in another question trying to call RDD operations on a broadcast variable? 只是想澄清一些问题,一些悬而未决的问题,这个问题是通过在另一个问题中观察用户而产生的,该问题试图对广播变量调用RDD操作? That's wrong, right? 错了吧

Question Is: A Spark broadcast variable is not an RDD, correct? 问题是: Spark广播变量不是RDD,对吗? It's a collection in Scala, am I seeing that correctly? 这是Scala中的一个收藏,我看对了吗?

Looking at the Scala docs: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.broadcast.Broadcast 查看Scala文档: http : //spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.broadcast.Broadcast

So it has whatever sub-type it's assigned when it's created, the sub-type of whatever is passed to it? 因此,它具有在创建时分配的任何子类型,以及传递给它的任何子类型? Like if this was a Java ArrayList it would be an ArrayList of Integers? 就像这是一个Java ArrayList一样,它将是一个整数的ArrayList吗? So 所以

sc.broadcast([0,1,2]) would create a Broadcast[Array[Int]] in scala-notation? sc.broadcast([0,1,2])是否会以scala表示形式创建Broadcast [Array [Int]]?

scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)

scala> broadcastVar.value
res0: Array[Int] = Array(1, 2, 3)

( I really did search around quite a bit for a clear straighforward answer but it must be too basic of a question, yet so important to understand, thanks.) (我确实确实在很多地方进行了搜索,以找到一个清晰易懂的答案,但这必须是一个问题的基本知识,但理解起来非常重要,谢谢。)

Would be nice but not necessary to have some info on what Python does with Broadcasts, I assume it calls the underlying Scala class and it's stored as a Scala Broadcast type underneath the hood? 很好,但不必了解有关Python使用Broadcasts的一些信息,我假设它调用了底层的Scala类,并将其存储为引擎盖下的Scala Broadcast类型。

A broadcast variable is not an RDD, however it's not necessarily a scala collection either. 广播变量不是RDD,但是也不一定是scala集合。 Essentially you should just think of a broadcast variable as a local variable that is local to every machine. 本质上,您应该只将广播变量视为对每台计算机都本地的本地变量。 Every worker will have a copy of whatever you've broadcasted so you don't need to worry about assigning it to specific RDD values. 每个工作人员都会拥有您广播的任何内容的副本,因此您不必担心将其分配给特定的RDD值。

The best time to use and RDD is when you have a fairly large object that you're going to need for most values in the RDD. 使用RDD的最佳时间是当您拥有一个相当大的对象时,您将需要RDD中的大多数值。

An example would be 一个例子是

val zipCodeHash:HashMap[(Int, List[Resident])] //potentially a very large hashmap
val BVZipHash = sc.broadcast(zipCodeHash)

val zipcodes:Rdd[String] = sc.textFile("../zipcodes.txt")

val allUsers = zipcodes.flatMap(a => BVZipHash.value((a.parseInt)))

In this situation since the hashmap could potentially be very large it would be extremely wasteful to create a new copy for every value in the map function. 在这种情况下,由于哈希映射可能非常大,因此为映射函数中的每个值创建一个新副本将非常浪费。

I hope this helps! 我希望这有帮助!

edit: some minor mistakes in my code 编辑:我的代码中的一些小错误

edit2: 编辑2:

To go slightly more into the nuts and bolts of what a Broadcast variable actually is: 稍微深入了解一下广播变量实际上是什么:

A broadcast variable actually a variable of type Broadcast that can contain any class (anything from an Int to any object you create). Broadcast变量实际上是Broadcast类型的变量,可以包含任何类(从Int到您创建的任何对象的任何类)。 It is by no means a scala collection. 它绝不是scala集合。 All the broadcast class actually does is offer one of two ways of efficiently transporting the data to all the workers to recreate the values (internally spark has a bittorent-like P2P broadcasting system, though it also allows http transferring, though I'm not sure when it does either). 广播类实际上所做的只是提供将数据有效地传输给所有工作人员以重新创建值的两种方法之一(内部spark具有类似B2P的苦涩的P2P广播系统,尽管它也允许http传输,尽管我不确定当它这样做时)。

For more information on what a broadcast variable is and how use it I'd recommend checking out this link: 有关广播变量是什么以及如何使用的更多信息,建议您查看以下链接:

http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables

I'd also highly recommend looking into this book as it's been very helpful to me: 我也强烈建议您阅读本书,因为这对我非常有帮助:

http://shop.oreilly.com/product/0636920028512.do http://shop.oreilly.com/product/0636920028512.do

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM