简体   繁体   English

Spark RDD不变性混淆

[英]Spark RDD immutability Confusion

I am currently preparing for my job interview as a Data Engineer.我目前正在准备作为数据工程师的工作面试。 I am stuck with confusion.我陷入了困惑。 Below are the details.以下是详细信息。

If Spark RDDs are immutable by nature then why are we able to create spark RDDs with var?如果 Spark RDD 本质上是不可变的,那么为什么我们能够使用 var 创建 Spark RDD?

Your confusion has little to do with Spark's RDDs.您的困惑与 Spark 的 RDD 关系不大。 It will help to understand the difference between a variable and an object.它将有助于理解变量和对象之间的区别。 A more familiar example:一个更熟悉的例子:

Suppose you have a String, which we all know is an immutable type:假设你有一个字符串,我们都知道它是一个不可变类型:

var text = "abc"                 //1
var text1 = text                 //2
text = text.substring(0,2)       //3

Like Spark's RDDs, String s are immutable.像 Spark 的 RDD 一样, String是不可变的。 But what do you think lines 1, 2, and 3 above do?但是你认为上面的第 1、2 和 3 行是做什么的? Would you say that line 3 changed text ?您会说第 3 行更改了text吗? This is where your confusion is: text is a variable.这就是您的困惑所在: text是一个变量。 When declared (line 1), text is a variable pointing to a String object ( "abc" ) in memory.声明时(第 1 行), text是一个变量,指向内存中的 String 对象 ( "abc" )。 That "abc" String object is not modified by line 3, but line 3 creates a new String object ( "ab" ), but reuses the same variable text to point to it. "abc" String 对象没有被第 3 行修改,但第 3 行创建了一个新的String对象( "ab" ),但重用了相同的变量text来指向它。 To reinforce this, note that text and text1 are two different variables pointing to the same object (the same "abc" that was created by line 1)为了加强这一点,请注意texttext1是两个不同的变量,指向同一个对象(第 1 行创建的同一个"abc"

If you see that a variable and an object it may point to are two different things, it's easy to apply this to your RDD example (it's in fact very similar to the String example above):如果你看到一个变量和一个它可能指向的对象是两个不同的东西,很容易将它应用到你的 RDD 示例中(它实际上与上面的 String 示例非常相似):

var a = sc.parallelize(Seq("1", "2", "3")) //like line 1 above
a = a.map(_ + " is a number")              //like line 3 above 

So, the first line creates an RDD object in memory, and then declares a variable a , then makes a point to that RDD object.所以,第一行创建在存储器中的RDD对象,然后声明一个变量a ,然后使a点到该RDD对象。 The second line computes a new RDD object (off the first one), but reuses the same variable.第二行计算一个新的 RDD 对象(从第一个对象开始),但重用了相同的变量。

This means that a.map(_ + " is a number") creates a new RDD object from the first one (and the first one is just no longer assigned to a variable because you reused the same variable to point to the derived RDD).这意味着a.map(_ + " is a number")从第一个创建一个新的RDD 对象(第一个不再分配给变量,因为您重用了相同的变量来指向派生的 RDD) .

In short, then: when we say that Spark's RDDs are immutable, we mean that those objects (not the variables pointing to them) cannot be mutated (the object's structure in memory cannot be modified) even if the non-final variables pointing to them can be reassigned, just as is the case with String objects.简而言之:当我们说 Spark 的 RDD 是不可变的时,我们的意思是那些对象(不是指向它们的变量)不能被改变(对象在内存中的结构不能被修改),即使指向它们的非最终变量可以重新分配,就像 String 对象的情况一样。

This is about programming fundamentals, and I'd suggest you go through some analogies on this post: What is the difference between a variable, object, and reference?这是关于编程基础知识,我建议您在这篇文章中进行一些类比: 变量、对象和引用之间有什么区别?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM