简体   繁体   中英

How can I optimize the conversion of a Data Frame to Map[String,List[String]] in scala/spark?

I have a DataFrame with a lot of Signals and I want to convert it into a Map[String, List[String]]

I have running Code, but I have the problem that it takes very long to execute it. For only a bunch of hundred signals it needs about 13 minutes.

This is the inputDataFrame I got in the beginning (example):

+----------+-----+
|SignalName|Value|
+----------+-----+
|        S1|   V1|
|        S2|   V1|
|        S1|   V2|
|        S2|   V2|
|        S3|   V1|
|        S1|   V3|
|        S1|   V1|
+----------+-----+

Then I want to filter the duplicates

var reducedDF = inputDataFrame.select("SignalName","Value").dropDuplicates()

The ouput for reduedDF.show :

+----------+-----+
|SignalName|Value|
+----------+-----+
|        S1|   V1|
|        S1|   V2|
|        S1|   V3|
|        S2|   V1|
|        S2|   V2|
|        S3|   V1|
+----------+-----+

The next step is to get an RDD of SignalNames without an duplicate. And I used zipWithIndex(), because later I want to read every value of the RDD. I can do this with the following code:

var RDDOfSignalNames = reducedDF.select("SignalName").rdd.map(r => r(0).asInstanceOf[String])  
RDDOfSignalNames = RDDOfSignalNames.distinct() 
val RDDwithIndex = RDDOfSignalNames.zipWithIndex() 
val indexKey = RDDwithIndex.map { case (k, v) => (v, k) }

And now the last step is to get for every SignalName every possible Value as a List of Type List[String] and add it to an Map:

var dataTmp: DataFrame = null
var signalname = Seq[String]("")
var map = scala.collection.mutable.Map[String, List[String]]()

for (i <- 0 to (RDDOfSignalNames.count()).toInt - 1) {

  signalname = indexKey.lookup(i)

  dataTmp = reducedDF.filter(data.col("Signalname").contains(signalname(0)))          

  map += (signalname(0) -> dataTmp.rdd.map(r => r(0).asInstanceOf[String]).collect().toList) 
  println(i+"/"+(RDDOfSignalNames.count().toInt - 1).toString())

}

In the End the Map looks like this:

scala.collection.mutable.Map[String,List[String]] = Map(S1 -> List(V1, V2, V3), S3 -> List(V1), S2 -> List(V1, V2))

The Problem is the line map += ... for 106 Signals it takes about 13 minutes! Is there more efficient way to do this?

First of all, use of var is not recommended in scala . You should always try to use immutable variables . So changing the following line

var reducedDF = inputDataFrame.select("SignalName","Value").dropDuplicates()

to

val reducedDF = inputDataFrame.select("SignalName","Value").distinct()

is preferred.

And ,

You don't need to go through such complexities to get your desired output. You can get your desired output doing the following

import org.apache.spark.sql.functions.collect_list
reducedDF
      .groupBy("SignalName")
      .agg(collect_list($"Value").as("Value"))
      .rdd
      .map(row => (row(0).toString -> row(1).asInstanceOf[scala.collection.mutable.WrappedArray[String]].toList))
      .collectAsMap()

where,
reducedDF.groupBy("SignalName").agg(collect_list($"Value").as("Value")) gives you dataframe as

+----------+------------+
|SignalName|Value       |
+----------+------------+
|S3        |[V1]        |
|S2        |[V2, V1]    |
|S1        |[V1, V2, V3]|
+----------+------------+

the rest of the code .rdd.map(row => (row(0).toString -> row(1).asInstanceOf[scala.collection.mutable.WrappedArray[String]].toList)).collectAsMap() is just converting the dataframe to your desired output Map .

final map output is

Map(S1 -> List(V1, V2, V3), S3 -> List(V1), S2 -> List(V2, V1))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM