简体   繁体   English

Spark toLocalIterator和迭代器方法之间的区别

[英]Difference between Spark toLocalIterator and iterator methods

While coding the Spark programs i came across this toLocalIterator() method. 在编写Spark程序时,我遇到了这个toLocalIterator()方法。 As earlier i was using only iterator() method. 正如早些时候我只使用iterator()方法。

If anyone has ever used this method please throw some lights. 如果有人曾经使用过这种方法,请扔一些灯。

I came across while using foreach and foreachPartition methods in Spark program. 我在Spark程序中使用foreachforeachPartition方法时遇到过。

Can I pass the foreach method result to toLocalIterator method or vice verse. 我可以将foreach方法结果传递给toLocalIterator方法或反之亦然。

toLocalIterator() -> foreachPartition()
iterator() -> foreach()

First of all, the iterator method from an RDD should not be called. 首先,不应该调用RDD中的iterator方法。 As you can read in the [Javadocs]( https://spark.apache.org/docs/1.0.2/api/java/org/apache/spark/rdd/RDD.html#iterator(org.apache.spark.Partition , org.apache.spark.TaskContext)): This should ''not'' be called by users directly, but is available for implementors of custom subclasses of RDD. 正如您可以阅读[Javadocs]( https://spark.apache.org/docs/1.0.2/api/java/org/apache/spark/rdd/RDD.html#iterator ( org.apache.spark)。 Partition ,org.apache.spark.TaskContext)): 这应该由用户“直接”调用,但可用于RDD自定义子类的实现者。

As for the toLocalIterator , it is used to collect the data from the RDD scattered around your cluster into one only node, the one from which the program is running, and do something with all the data in the same node. 对于toLocalIterator ,它用于将分散在集群周围的RDD中的数据收集到一个唯一的节点(运行程序的节点),并对同一节点中的所有数据执行某些操作。 It is similar to the collect method, but instead of returning a List it will return an Iterator . 它类似于collect方法,但它不会返回List ,而是返回一个Iterator

foreach is used to apply a function to each of the elements of the RDD, while foreachPartition is to apply a function to each of the partitions. foreach用于将函数应用于RDD的每个元素,而foreachPartition用于将函数应用于每个分区。 In the first approach you get one element at a time (to parallelize more) and in the second one you get the whole partition (if you need to perform an operation with all the data). 在第一种方法中,您一次获得一个元素(并行化更多),在第二种方法中,您获得整个分区(如果需要对所有数据执行操作)。

So yes, after applying a function to an RDD using foreach or foreachPartition you can call toLocalIterator to get an iterator with all the contents of the RDD and process it. 所以,是的,在使用foreachforeachPartition将函数应用于RDD foreachPartition您可以调用toLocalIterator来获取包含RDD所有内容的迭代器并对其进行处理。 However, bear in mind that if your RDD is very big, you may have memory issues. 但是,请记住,如果您的RDD非常大,您可能会遇到内存问题。 If you want to transform it to an RDD again after doing the operations you need, use the SparkContext to parallelize it again. 如果要在执行所需的操作后再次将其转换为RDD,请使用SparkContext再次并行化它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM