如何避免在 Scala 的 Spark RDD 中使用 collect？

Question

I have a List and has to create Map from this for further use, I am using RDD, but with use of collect(), job is failing in cluster.我有一个列表，必须从中创建 Map 以供进一步使用，我正在使用 RDD，但使用 collect()，集群中的作业失败。 Any help is appreciated.任何帮助表示赞赏。

Please help.请帮忙。 Below is the sample code from List to rdd.collect.下面是从 List 到 rdd.collect 的示例代码。 I have to use this Map data further but how to use without collect?我必须进一步使用这个 Map 数据但是如何在不收集的情况下使用？

This code creates a Map from RDD (List) Data.此代码从 RDD（列表）数据创建 Map。 List Format->(asdfg/1234/wert,asdf)列表格式->(asdfg/1234/wert,asdf)

 //List Data to create Map
 val listData = methodToGetListData(ListData).toList
//Creating RDD from above List  

  val rdd = sparkContext.makeRDD(listData)

      implicit val formats = Serialization.formats(NoTypeHints)
      val res = rdd
        .map(map => (getRPath(map._1), getAttribute(map._1), map._2))
        .groupBy(_._1)
        .map(tuple => {
          Map(
            "P_Id" -> "1234",
            "R_Time" -> "27-04-2020",
            "S_Time" -> "27-04-2020",
            "r_path" -> tuple._1,
            "S_Tag" -> "12345,
            tuple._1 -> (tuple._2.map(a => (a._2, a._3)).toMap)
          )
        })

      res.collect()
    }

Answer 1

Q: how to use without collect? Q：不收藏怎么用？

Answer: collect will hit.. it will move the data to driver node.答： collect会命中..它将数据移动到驱动程序节点。 if data is huge.如果数据很大。 Never do that.永远不要那样做。

I dont exactly know what is the use case to prepare a map but it can be achievable using built in spark API ie collectionAccumulator ... in detail,我不完全知道准备map的用例是什么，但可以使用内置的 spark API即collectionAccumulator来实现......详细，

collectionAccumulator[scala.collection.mutable.Map[String, String]]

Lets suppose, this is your sample dataframe and you want to make a map.假设，这是您的样品 dataframe，您想要制作 map。

+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
|Item_Id|Parent_Id|object_class_instance|Received_Time|CablesName|CablesStatus|CablesHInfoID|CablesIndex|object_class|ServiceTag|Scan_Time|relation_tree                  |
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
|-0909  |1234     |Cables-1             |23-12-2020   |LC        |Installed   |ABCD1234     |0          |Cables      |ASDF123   |12345    |Start~>HInfo->Cables->Cables-1 |
|-09091 |1234111  |Cables-11            |23-12-2022   |LC1       |Installed1  |ABCD12341    |0          |Cables1     |ASDF1231  |123451   |Start~>HInfo->Cables->Cables-11|
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+

From this you want to make a map ( nested map I prefixed with nestedmap key name in your example ) then...从这里你想制作一个 map （嵌套的 map 我在你的例子中以嵌套映射键名为前缀）然后......

Below is the full example have a look and modify accordingly.下面是完整的示例，请查看并进行相应修改。

package examples

import org.apache.log4j.Level

object GrabMapbetweenClosure extends App {
  val logger = org.apache.log4j.Logger.getLogger("org")
  logger.setLevel(Level.WARN)


  import org.apache.spark.sql.SparkSession

  val spark = SparkSession
    .builder()
    .master("local[*]")
    .appName(this.getClass.getName)
    .getOrCreate()

  import spark.implicits._

  var mutableMapAcc = spark.sparkContext.collectionAccumulator[scala.collection.mutable.Map[String, String]]("mutableMap")

  val df = Seq(
    ("-0909", "1234", "Cables-1", "23-12-2020", "LC", "Installed", "ABCD1234"
      , "0", "Cables", "ASDF123", "12345", "Start~>HInfo->Cables->Cables-1")
    , ("-09091", "1234111", "Cables-11", "23-12-2022", "LC1", "Installed1", "ABCD12341"
      , "0", "Cables1", "ASDF1231", "123451", "Start~>HInfo->Cables->Cables-11")

  ).toDF("Item_Id", "Parent_Id", "object_class_instance", "Received_Time", "CablesName", "CablesStatus", "CablesHInfoID",
    "CablesIndex", "object_class", "ServiceTag", "Scan_Time", "relation_tree"
  )

  df.show(false)
  df.foreachPartition { partition => // for performance sake I used foreachPartition
    partition.foreach {
      record => {
        mutableMapAcc.add(scala.collection.mutable.Map(
          "Item_Id" -> record.getAs[String]("Item_Id")
          , "CablesStatus" -> record.getAs[String]("CablesStatus")
          , "CablesHInfoID" -> record.getAs[String]("CablesHInfoID")
          , "Parent_Id" -> record.getAs[String]("Parent_Id")
          , "CablesIndex" -> record.getAs[String]("CablesIndex")
          , "object_class_instance" -> record.getAs[String]("object_class_instance")
          , "Received_Time" -> record.getAs[String]("Received_Time")
          , "object_class" -> record.getAs[String]("object_class")
          , "CablesName" -> record.getAs[String]("CablesName")
          , "ServiceTag" -> record.getAs[String]("ServiceTag")
          , "Scan_Time" -> record.getAs[String]("Scan_Time")
          , "relation_tree" -> record.getAs[String]("relation_tree")

        )
        )
      }
    }
  }
  println("FinalMap : " + mutableMapAcc.value.toString)

}

Result:结果：

+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
|Item_Id|Parent_Id|object_class_instance|Received_Time|CablesName|CablesStatus|CablesHInfoID|CablesIndex|object_class|ServiceTag|Scan_Time|relation_tree                  |
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
|-0909  |1234     |Cables-1             |23-12-2020   |LC        |Installed   |ABCD1234     |0          |Cables      |ASDF123   |12345    |Start~>HInfo->Cables->Cables-1 |
|-09091 |1234111  |Cables-11            |23-12-2022   |LC1       |Installed1  |ABCD12341    |0          |Cables1     |ASDF1231  |123451   |Start~>HInfo->Cables->Cables-11|
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+

FinalMap : [Map(Scan_Time -> 123451, ServiceTag -> ASDF1231, Received_Time -> 23-12-2022, object_class_instance -> Cables-11, CablesHInfoID -> ABCD12341, Parent_Id -> 1234111, Item_Id -> -09091, CablesIndex -> 0, object_class -> Cables1, relation_tree -> Start~>HInfo->Cables->Cables-11, CablesName -> LC1, CablesStatus -> Installed1), Map(Scan_Time -> 12345, ServiceTag -> ASDF123, Received_Time -> 23-12-2020, object_class_instance -> Cables-1, CablesHInfoID -> ABCD1234, Parent_Id -> 1234, Item_Id -> -0909, CablesIndex -> 0, object_class -> Cables, relation_tree -> Start~>HInfo->Cables->Cables-1, CablesName -> LC, CablesStatus -> Installed)]

Similar problem was solved here.类似的问题在这里得到了解决。

如何避免在 Scala 的 Spark RDD 中使用 collect？

问题描述

1 个解决方案

解决方案1
2 已采纳 2020-04-29 04:41:25

如何避免在 Scala 的 Spark RDD 中使用 collect？

问题描述

1 个解决方案

解决方案1 2 已采纳 2020-04-29 04:41:25

解决方案1
2 已采纳 2020-04-29 04:41:25