简体   繁体   中英

Spark SQL Dataframes - replace function from DataFrameNaFunctions does not work if the Map is created with RDD.collectAsMap()

From DataFrameNaFunctions I am using replace function to replace values of a column in a dataframe with those from a Map.

The keys & values of the Map are available as a delimited file. These are read into an RDD, then transformed to a pair RDD and converted to a Map. For example a text file of month number & month name available as a file as shown below:

01,January
02,February
03,March
...   ...
...   ...

val mRDD1 = sc.textFile("file:///.../monthlist.txt")

When this data is transformed as a Map using RDD.collect().toMap as given below the dataframe.na.replace function works fine which I am referring as Method 1.

val monthMap1= mRDD1.map(_.split(",")).map(line => (line(0),line(1))).collect().toMap
monthMap1: scala.collection.immutable.Map[String,String] = Map(12 -> December, 08 -> August, 09 ->         September, 11 -> November, 05 -> May, 04 -> April, 10 -> October, 03 -> March, 06 -> June, 02 -> February, 07 -> July, 01 -> January)

val df2 = df1.na.replace("monthname", monthMap1)
df2: org.apache.spark.sql.DataFrame = [col1: int, col2: string ... 13 more fields]

However when this data is transformed as a Map using RDD.collectAsMap() as shown below since it is not an immutable Map it is not working which I am calling Method 2. Is there simple a way to convert this scala.collection.Map into scala.collection.immutable.Map so that it does not give this error?

val monthMap2= mRDD1.map(_.split(",")).map(line => (line(0),line(1))).collectAsMap()
monthMap2: scala.collection.Map[String,String] = Map(12 -> December, 09 -> September, 03 -> March, 06 -> June, 11 -> November, 05 -> May, 08 -> August, 02 -> February, 01 -> January, 10 -> October, 04 -> April, 07 -> July)

val df3 = df1.na.replace("monthname", monthMap2)
<console>:30: error: overloaded method value replace with alternatives:
  [T](cols: Seq[String], replacement: scala.collection.immutable.Map[T,T])org.apache.spark.sql.DataFrame <and>
  [T](col: String, replacement: scala.collection.immutable.Map[T,T])org.apache.spark.sql.DataFrame <and>
  [T](cols: Array[String], replacement: java.util.Map[T,T])org.apache.spark.sql.DataFrame <and>
  [T](col: String, replacement: java.util.Map[T,T])org.apache.spark.sql.DataFrame
 cannot be applied to (String, scala.collection.Map[String,String])
       val cdf3 = cdf2.na.replace("monthname", monthMap2)
                          ^

Method 1 mentioned above is working fine. However, for using Method 2, I would like to know what is the simple and direct way to convert a scala.collection.Map into scala.collection.immutable.Map and which libraries I need to import as well.

Thanks

You can try this:

val monthMap2 = mRDD1.map(_.split(",")).map(line => (line(0),line(1))).collectAsMap()

// create an immutable map from monthMap2
val monthMap = collection.immutable.Map(monthMap2.toSeq: _*)

val df3 = df1.na.replace("monthname", monthMap)

The method replace takes also a java map, you can also convert it like this:

import scala.jdk.CollectionConverters._

val df3 = df1.na.replace("monthname", monthMap2.asJava)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM