[英]Apache Spark: map vs mapPartitions?
What's the difference between an RDD's map
and mapPartitions
method? RDD 的map
和mapPartitions
方法有什么mapPartitions
? And does flatMap
behave like map
or like mapPartitions
? flatMap
行为像map
还是像mapPartitions
? Thanks.谢谢。
(edit) ie what is the difference (either semantically or in terms of execution) between (编辑)即有什么区别(在语义上或在执行方面)
def map[A, B](rdd: RDD[A], fn: (A => B))
(implicit a: Manifest[A], b: Manifest[B]): RDD[B] = {
rdd.mapPartitions({ iter: Iterator[A] => for (i <- iter) yield fn(i) },
preservesPartitioning = true)
}
And:和:
def map[A, B](rdd: RDD[A], fn: (A => B))
(implicit a: Manifest[A], b: Manifest[B]): RDD[B] = {
rdd.map(fn)
}
Whenever you have heavyweight initialization that should be done once for many
RDD
elements rather than once perRDD
element, and if this initialization, such as creation of objects from a third-party library, cannot be serialized (so that Spark can transmit it across the cluster to the worker nodes), usemapPartitions()
instead ofmap()
.每当你有重量级的初始化应该对许多RDD
元素进行一次而不是每个RDD
元素一次,并且如果这个初始化,例如从第三方库创建对象,不能被序列化(这样 Spark 可以将它传输到集群到工作节点),使用mapPartitions()
而不是map()
。mapPartitions()
provides for the initialization to be done once per worker task/thread/partition instead of once perRDD
data element for example : see below.mapPartitions()
为每个工作任务/线程/分区提供一次初始化,而不是每个RDD
数据元素一次,例如:见下文。
val newRd = myRdd.mapPartitions(partition => {
val connection = new DbConnection /*creates a db connection per partition*/
val newPartition = partition.map(record => {
readMatchingFromDB(record, connection)
}).toList // consumes the iterator, thus calls readMatchingFromDB
connection.close() // close dbconnection here
newPartition.iterator // create a new iterator
})
Q2. Q2。 does
flatMap
behave like map or likemapPartitions
?flatMap
行为像 map 还是像mapPartitions
?
Yes.是的。 please see example 2 of flatmap
.. its self explanatory.请参阅flatmap
示例 2 .. 不言自明。
Q1.一季度。 What's the difference between an RDD's
map
andmapPartitions
RDD 的map
和mapPartitions
什么区别
map
works the function being utilized at a per element level whilemapPartitions
exercises the function at the partition level.map
在每个元素级别执行正在使用的函数,而mapPartitions
在分区级别执行该函数。
Example Scenario : if we have 100K elements in a particular RDD
partition then we will fire off the function being used by the mapping transformation 100K times when we use map
.示例场景:如果我们在特定RDD
分区中有 100K 个元素,那么当我们使用map
时,我们将触发映射转换使用的函数 100K 次。
Conversely, if we use mapPartitions
then we will only call the particular function one time, but we will pass in all 100K records and get back all responses in one function call.相反,如果我们使用mapPartitions
那么我们只会调用特定的函数一次,但我们将传入所有 100K 记录并在一次函数调用中返回所有响应。
There will be performance gain since map
works on a particular function so many times, especially if the function is doing something expensive each time that it wouldn't need to do if we passed in all the elements at once(in case of mappartitions
).由于map
在特定函数map
工作了很多次,因此性能会有所提高,特别是如果该函数每次都在做一些昂贵的事情,而如果我们一次传入所有元素(在mappartitions
情况下)则mappartitions
。
Applies a transformation function on each item of the RDD and returns the result as a new RDD.在 RDD 的每一项上应用一个转换函数,并将结果作为一个新的 RDD 返回。
Listing Variants列出变体
def map[U: ClassTag](f: T => U): RDD[U] def map[U: ClassTag](f: T => U): RDD[U]
Example :例子 :
val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3)
val b = a.map(_.length)
val c = a.zip(b)
c.collect
res0: Array[(String, Int)] = Array((dog,3), (salmon,6), (salmon,6), (rat,3), (elephant,8))
This is a specialized map that is called only once for each partition.这是一个专门的映射,每个分区只调用一次。 The entire content of the respective partitions is available as a sequential stream of values via the input argument (Iterarator[T]).各个分区的全部内容可通过输入参数 (Iterarator[T]) 作为连续值流使用。 The custom function must return yet another Iterator[U].自定义函数必须返回另一个 Iterator[U]。 The combined result iterators are automatically converted into a new RDD.组合的结果迭代器会自动转换为新的 RDD。 Please note, that the tuples (3,4) and (6,7) are missing from the following result due to the partitioning we chose.请注意,由于我们选择了分区,以下结果中缺少元组 (3,4) 和 (6,7)。
preservesPartitioning
indicates whether the input function preserves the partitioner, which should befalse
unless this is a pair RDD and the input function doesn't modify the keys.preservesPartitioning
指示输入函数是否保留分区器,除非这是一对 RDD 并且输入函数不修改键,否则应该为false
。Listing Variants列出变体
def mapPartitions[U: ClassTag](f: Iterator[T] => Iterator[U], preservesPartitioning: Boolean = false): RDD[U] def mapPartitions[U: ClassTag](f: Iterator[T] => Iterator[U], preservesPartitioning: Boolean = false): RDD[U]
Example 1示例 1
val a = sc.parallelize(1 to 9, 3)
def myfunc[T](iter: Iterator[T]) : Iterator[(T, T)] = {
var res = List[(T, T)]()
var pre = iter.next
while (iter.hasNext)
{
val cur = iter.next;
res .::= (pre, cur)
pre = cur;
}
res.iterator
}
a.mapPartitions(myfunc).collect
res0: Array[(Int, Int)] = Array((2,3), (1,2), (5,6), (4,5), (8,9), (7,8))
Example 2示例 2
val x = sc.parallelize(List(1, 2, 3, 4, 5, 6, 7, 8, 9,10), 3)
def myfunc(iter: Iterator[Int]) : Iterator[Int] = {
var res = List[Int]()
while (iter.hasNext) {
val cur = iter.next;
res = res ::: List.fill(scala.util.Random.nextInt(10))(cur)
}
res.iterator
}
x.mapPartitions(myfunc).collect
// some of the number are not outputted at all. This is because the random number generated for it is zero.
res8: Array[Int] = Array(1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 5, 7, 7, 7, 9, 9, 10)
The above program can also be written using flatMap as follows.上面的程序也可以用 flatMap 写成如下。
Example 2 using flatmap示例 2 使用 flatmap
val x = sc.parallelize(1 to 10, 3)
x.flatMap(List.fill(scala.util.Random.nextInt(10))(_)).collect
res1: Array[Int] = Array(1, 2, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10)
mapPartitions
transformation is faster than map
since it calls your function once/partition, not once/element.. mapPartitions
转换比map
更快,因为它调用您的函数一次/分区,而不是一次/元素。
Further reading : foreach Vs foreachPartitions When to use What?进一步阅读: foreach Vs foreachPartitions 何时使用什么?
What's the difference between an RDD's map and mapPartitions method? RDD 的 map 和 mapPartitions 方法有什么区别?
The method map converts each element of the source RDD into a single element of the result RDD by applying a function.方法map通过应用函数将源 RDD 的每个元素转换为结果 RDD 的单个元素。 mapPartitions converts each partition of the source RDD into multiple elements of the result (possibly none). mapPartitions将源 RDD 的每个分区转换为结果的多个元素(可能没有)。
And does flatMap behave like map or like mapPartitions? flatMap 的行为像 map 还是像 mapPartitions?
Neither, flatMap works on a single element (as map
) and produces multiple elements of the result (as mapPartitions
).两者都不是, flatMap对单个元素(如map
)工作并产生结果的多个元素(如mapPartitions
)。
Map :地图:
- It processes one row at a time , very similar to map() method of MapReduce.它一次处理一行,非常类似于 MapReduce 的 map() 方法。
- You return from the transformation after every row.您在每一行之后从转换中返回。
MapPartitions地图分区
- It processes the complete partition in one go.它一次性处理完整的分区。
- You can return from the function only once after processing the whole partition.处理整个分区后,您只能从该函数返回一次。
- All intermediate results needs to be held in memory till you process the whole partition.所有中间结果都需要保存在内存中,直到您处理整个分区。
- Provides you like setup() map() and cleanup() function of MapReduce提供你喜欢的MapReduce的setup()、map()和cleanup()函数
Map Vs mapPartitions
http://bytepadding.com/big-data/spark/spark-map-vs-mappartitions/Map Vs mapPartitions
http://bytepadding.com/big-data/spark/spark-map-vs-mappartitions/
Spark Map
http://bytepadding.com/big-data/spark/spark-map/Spark Map
http://bytepadding.com/big-data/spark/spark-map/
Spark mapPartitions
http://bytepadding.com/big-data/spark/spark-mappartitions/Spark mapPartitions
http://bytepadding.com/big-data/spark/spark-mappartitions/
Map:地图:
Map transformation.地图变换。
The map works on a single Row at a time.地图一次只处理一行。
Map returns after each input Row. Map 在每个输入 Row 之后返回。
The map doesn't hold the output result in Memory.地图不将输出结果保存在内存中。
Map no way to figure out then to end the service. Map没办法搞清楚然后就结束服务了。
// map example
val dfList = (1 to 100) toList
val df = dfList.toDF()
val dfInt = df.map(x => x.getInt(0)+2)
display(dfInt)
MapPartition:地图分区:
MapPartition transformation. MapPartition 转换。
MapPartition works on a partition at a time. MapPartition 一次作用于一个分区。
MapPartition returns after processing all the rows in the partition. MapPartition 处理完分区中的所有行后返回。
MapPartition output is retained in memory, as it can return after processing all the rows in a particular partition. MapPartition 输出保留在内存中,因为它可以在处理特定分区中的所有行后返回。
MapPartition service can be shut down before returning. MapPartition 服务可以在返回前关闭。
// MapPartition example
Val dfList = (1 to 100) toList
Val df = dfList.toDF()
Val df1 = df.repartition(4).rdd.mapPartition((int) => Iterator(itr.length))
Df1.collec()
//display(df1.collect())
For more details, please refer to the Spark map vs mapPartitions transformation article.更多细节请参考Spark map vs mapPartitions 转换文章。
Hope this is helpful!希望这是有帮助的!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.