简体   繁体   English

Apache Spark:地图与地图分区?

[英]Apache Spark: map vs mapPartitions?

What's the difference between an RDD's map and mapPartitions method? RDD 的mapmapPartitions方法有什么mapPartitions And does flatMap behave like map or like mapPartitions ? flatMap行为像map还是像mapPartitions Thanks.谢谢。

(edit) ie what is the difference (either semantically or in terms of execution) between (编辑)即有什么区别(在语义上或在执行方面)

  def map[A, B](rdd: RDD[A], fn: (A => B))
               (implicit a: Manifest[A], b: Manifest[B]): RDD[B] = {
    rdd.mapPartitions({ iter: Iterator[A] => for (i <- iter) yield fn(i) },
      preservesPartitioning = true)
  }

And:和:

  def map[A, B](rdd: RDD[A], fn: (A => B))
               (implicit a: Manifest[A], b: Manifest[B]): RDD[B] = {
    rdd.map(fn)
  }

Imp.进出口。 TIP :提示 :

Whenever you have heavyweight initialization that should be done once for many RDD elements rather than once per RDD element, and if this initialization, such as creation of objects from a third-party library, cannot be serialized (so that Spark can transmit it across the cluster to the worker nodes), use mapPartitions() instead of map() .每当你有重量级的初始化应该对许多RDD元素进行一次而不是每个RDD元素一次,并且如果这个初始化,例如从第三方库创建对象,不能被序列化(这样 Spark 可以将它传输到集群到工作节点),使用mapPartitions()而不是map() mapPartitions() provides for the initialization to be done once per worker task/thread/partition instead of once per RDD data element for example : see below. mapPartitions()为每个工作任务/线程/分区提供一次初始化,而不是每个RDD数据元素一次,例如:见下文。

val newRd = myRdd.mapPartitions(partition => {
  val connection = new DbConnection /*creates a db connection per partition*/

  val newPartition = partition.map(record => {
    readMatchingFromDB(record, connection)
  }).toList // consumes the iterator, thus calls readMatchingFromDB 

  connection.close() // close dbconnection here
  newPartition.iterator // create a new iterator
})

Q2. Q2。 does flatMap behave like map or like mapPartitions ? flatMap行为像 map 还是像mapPartitions

Yes.是的。 please see example 2 of flatmap .. its self explanatory.请参阅flatmap示例 2 .. 不言自明。

Q1.一季度。 What's the difference between an RDD's map and mapPartitions RDD 的mapmapPartitions什么区别

map works the function being utilized at a per element level while mapPartitions exercises the function at the partition level. map在每个元素级别执行正在使用的函数,而mapPartitions在分区级别执行该函数。

Example Scenario : if we have 100K elements in a particular RDD partition then we will fire off the function being used by the mapping transformation 100K times when we use map .示例场景如果我们在特定RDD分区中有 100K 个元素,那么当我们使用map时,我们将触发映射转换使用的函数 100K 次。

Conversely, if we use mapPartitions then we will only call the particular function one time, but we will pass in all 100K records and get back all responses in one function call.相反,如果我们使用mapPartitions那么我们只会调用特定的函数一次,但我们将传入所有 100K 记录并在一次函数调用中返回所有响应。

There will be performance gain since map works on a particular function so many times, especially if the function is doing something expensive each time that it wouldn't need to do if we passed in all the elements at once(in case of mappartitions ).由于map在特定函数map工作了很多次,因此性能会有所提高,特别是如果该函数每次都在做一些昂贵的事情,而如果我们一次传入所有元素(在mappartitions情况下)则mappartitions

map地图

Applies a transformation function on each item of the RDD and returns the result as a new RDD.在 RDD 的每一项上应用一个转换函数,并将结果作为一个新的 RDD 返回。

Listing Variants列出变体

def map[U: ClassTag](f: T => U): RDD[U] def map[U: ClassTag](f: T => U): RDD[U]

Example :例子 :

val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3)
 val b = a.map(_.length)
 val c = a.zip(b)
 c.collect
 res0: Array[(String, Int)] = Array((dog,3), (salmon,6), (salmon,6), (rat,3), (elephant,8)) 

mapPartitions地图分区

This is a specialized map that is called only once for each partition.这是一个专门的映射,每个分区只调用一次。 The entire content of the respective partitions is available as a sequential stream of values via the input argument (Iterarator[T]).各个分区的全部内容可通过输入参数 (Iterarator[T]) 作为连续值流使用。 The custom function must return yet another Iterator[U].自定义函数必须返回另一个 Iterator[U]。 The combined result iterators are automatically converted into a new RDD.组合的结果迭代器会自动转换为新的 RDD。 Please note, that the tuples (3,4) and (6,7) are missing from the following result due to the partitioning we chose.请注意,由于我们选择了分区,以下结果中缺少元组 (3,4) 和 (6,7)。

preservesPartitioning indicates whether the input function preserves the partitioner, which should be false unless this is a pair RDD and the input function doesn't modify the keys. preservesPartitioning指示输入函数是否保留分区器,除非这是一对 RDD 并且输入函数不修改键,否则应该为false

Listing Variants列出变体

def mapPartitions[U: ClassTag](f: Iterator[T] => Iterator[U], preservesPartitioning: Boolean = false): RDD[U] def mapPartitions[U: ClassTag](f: Iterator[T] => Iterator[U], preservesPartitioning: Boolean = false): RDD[U]

Example 1示例 1

val a = sc.parallelize(1 to 9, 3)
 def myfunc[T](iter: Iterator[T]) : Iterator[(T, T)] = {
   var res = List[(T, T)]()
   var pre = iter.next
   while (iter.hasNext)
   {
     val cur = iter.next;
     res .::= (pre, cur)
     pre = cur;
   }
   res.iterator
 }
 a.mapPartitions(myfunc).collect
 res0: Array[(Int, Int)] = Array((2,3), (1,2), (5,6), (4,5), (8,9), (7,8)) 

Example 2示例 2

val x = sc.parallelize(List(1, 2, 3, 4, 5, 6, 7, 8, 9,10), 3)
 def myfunc(iter: Iterator[Int]) : Iterator[Int] = {
   var res = List[Int]()
   while (iter.hasNext) {
     val cur = iter.next;
     res = res ::: List.fill(scala.util.Random.nextInt(10))(cur)
   }
   res.iterator
 }
 x.mapPartitions(myfunc).collect
 // some of the number are not outputted at all. This is because the random number generated for it is zero.
 res8: Array[Int] = Array(1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 5, 7, 7, 7, 9, 9, 10) 

The above program can also be written using flatMap as follows.上面的程序也可以用 flatMap 写成如下。

Example 2 using flatmap示例 2 使用 flatmap

val x  = sc.parallelize(1 to 10, 3)
 x.flatMap(List.fill(scala.util.Random.nextInt(10))(_)).collect

 res1: Array[Int] = Array(1, 2, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10) 

Conclusion :结论 :

mapPartitions transformation is faster than map since it calls your function once/partition, not once/element.. mapPartitions转换比map更快,因为它调用您的函数一次/分区,而不是一次/元素。

Further reading : foreach Vs foreachPartitions When to use What?进一步阅读: foreach Vs foreachPartitions 何时使用什么?

What's the difference between an RDD's map and mapPartitions method? RDD 的 map 和 mapPartitions 方法有什么区别?

The method map converts each element of the source RDD into a single element of the result RDD by applying a function.方法map通过应用函数将源 RDD 的每个元素转换为结果 RDD 的单个元素。 mapPartitions converts each partition of the source RDD into multiple elements of the result (possibly none). mapPartitions将源 RDD 的每个分区转换为结果的多个元素(可能没有)。

And does flatMap behave like map or like mapPartitions? flatMap 的行为像 map 还是像 mapPartitions?

Neither, flatMap works on a single element (as map ) and produces multiple elements of the result (as mapPartitions ).两者都不是, flatMap对单个元素(如map )工作并产生结果的多个元素(如mapPartitions )。

Map :地图

  1. It processes one row at a time , very similar to map() method of MapReduce.它一次处理一行,非常类似于 MapReduce 的 map() 方法。
  2. You return from the transformation after every row.您在每一行之后从转换中返回。

MapPartitions地图分区

  1. It processes the complete partition in one go.它一次性处理完整的分区。
  2. You can return from the function only once after processing the whole partition.处理整个分区后,您只能从该函数返回一次。
  3. All intermediate results needs to be held in memory till you process the whole partition.所有中间结果都需要保存在内存中,直到您处理整个分区。
  4. Provides you like setup() map() and cleanup() function of MapReduce提供你喜欢的MapReduce的setup()、map()和cleanup()函数

Map Vs mapPartitions http://bytepadding.com/big-data/spark/spark-map-vs-mappartitions/ Map Vs mapPartitions http://bytepadding.com/big-data/spark/spark-map-vs-mappartitions/

Spark Map http://bytepadding.com/big-data/spark/spark-map/ Spark Map http://bytepadding.com/big-data/spark/spark-map/

Spark mapPartitions http://bytepadding.com/big-data/spark/spark-mappartitions/ Spark mapPartitions http://bytepadding.com/big-data/spark/spark-mappartitions/

Map:地图:

Map transformation.地图变换。

The map works on a single Row at a time.地图一次只处理一行。

Map returns after each input Row. Map 在每个输入 Row 之后返回。

The map doesn't hold the output result in Memory.地图不将输出结果保存在内存中。

Map no way to figure out then to end the service. Map没办法搞清楚然后就结束服务了。

// map example

val dfList = (1 to 100) toList

val df = dfList.toDF()

val dfInt = df.map(x => x.getInt(0)+2)

display(dfInt)

MapPartition:地图分区:

MapPartition transformation. MapPartition 转换。

MapPartition works on a partition at a time. MapPartition 一次作用于一个分区。

MapPartition returns after processing all the rows in the partition. MapPartition 处理完分区中的所有行后返回。

MapPartition output is retained in memory, as it can return after processing all the rows in a particular partition. MapPartition 输出保留在内存中,因为它可以在处理特定分区中的所有行后返回。

MapPartition service can be shut down before returning. MapPartition 服务可以在返回前关闭。

// MapPartition example

Val dfList = (1 to 100) toList

Val df = dfList.toDF()

Val df1 = df.repartition(4).rdd.mapPartition((int) => Iterator(itr.length))

Df1.collec()

//display(df1.collect())

For more details, please refer to the Spark map vs mapPartitions transformation article.更多细节请参考Spark map vs mapPartitions 转换文章。

Hope this is helpful!希望这是有帮助的!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM