[英]Difference between Spark toLocalIterator and iterator methods
While coding the Spark programs i came across this toLocalIterator()
method. 在编写Spark程序时,我遇到了这个
toLocalIterator()
方法。 As earlier i was using only iterator()
method. 正如早些时候我只使用
iterator()
方法。
If anyone has ever used this method please throw some lights. 如果有人曾经使用过这种方法,请扔一些灯。
I came across while using foreach
and foreachPartition
methods in Spark program. 我在Spark程序中使用
foreach
和foreachPartition
方法时遇到过。
Can I pass the foreach
method result to toLocalIterator
method or vice verse. 我可以将
foreach
方法结果传递给toLocalIterator
方法或反之亦然。
toLocalIterator() -> foreachPartition()
iterator() -> foreach()
First of all, the iterator
method from an RDD should not be called. 首先,不应该调用RDD中的
iterator
方法。 As you can read in the [Javadocs]( https://spark.apache.org/docs/1.0.2/api/java/org/apache/spark/rdd/RDD.html#iterator(org.apache.spark.Partition , org.apache.spark.TaskContext)): This should ''not'' be called by users directly, but is available for implementors of custom subclasses of RDD. 正如您可以阅读[Javadocs]( https://spark.apache.org/docs/1.0.2/api/java/org/apache/spark/rdd/RDD.html#iterator ( org.apache.spark)。 Partition ,org.apache.spark.TaskContext)): 这应该由用户“直接”调用,但可用于RDD自定义子类的实现者。
As for the toLocalIterator
, it is used to collect the data from the RDD scattered around your cluster into one only node, the one from which the program is running, and do something with all the data in the same node. 对于
toLocalIterator
,它用于将分散在集群周围的RDD中的数据收集到一个唯一的节点(运行程序的节点),并对同一节点中的所有数据执行某些操作。 It is similar to the collect
method, but instead of returning a List
it will return an Iterator
. 它类似于
collect
方法,但它不会返回List
,而是返回一个Iterator
。
foreach
is used to apply a function to each of the elements of the RDD, while foreachPartition
is to apply a function to each of the partitions. foreach
用于将函数应用于RDD的每个元素,而foreachPartition
用于将函数应用于每个分区。 In the first approach you get one element at a time (to parallelize more) and in the second one you get the whole partition (if you need to perform an operation with all the data). 在第一种方法中,您一次获得一个元素(并行化更多),在第二种方法中,您获得整个分区(如果需要对所有数据执行操作)。
So yes, after applying a function to an RDD using foreach
or foreachPartition
you can call toLocalIterator
to get an iterator with all the contents of the RDD and process it. 所以,是的,在使用
foreach
或foreachPartition
将函数应用于RDD foreachPartition
您可以调用toLocalIterator
来获取包含RDD所有内容的迭代器并对其进行处理。 However, bear in mind that if your RDD is very big, you may have memory issues. 但是,请记住,如果您的RDD非常大,您可能会遇到内存问题。 If you want to transform it to an RDD again after doing the operations you need, use the
SparkContext
to parallelize it again. 如果要在执行所需的操作后再次将其转换为RDD,请使用
SparkContext
再次并行化它。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.