简体   繁体   中英

Update columns of spark dataframe which are of type Map with desired key and value

I have a following spark dataframe where all the columns (except for primary key column emp_id) consist of a map (with keys 'from' and 'to' which can have null values). I want to evaluate 'from' and 'to' of each column(except emp_id) and add a new key to the map(named 'change') which has a value of a) 'insert' if 'from' value is null and 'to' is not null b) 'delete' if 'to' value is null and 'from' is not null b) 'update' if 'from' and 'to' are not null & 'from' value is different from 'to' value

Note: columns which have null value will remain untouched.

Important Note: The type of these columns is not Map[String, String] but instead something lie Map[String, Any] meaning the value can be of other struct objects

How can we achieve this in Scala.

|emp_id|emp_city             |emp_name                    |emp_phone            |emp_sal                    |emp_site                          |

|1     |null                 |[from -> Will, to -> Watson]|null                 |[from -> 1000, to -> 8000]|[from ->, to -> Seattle]          |
|3     |null                 |[from -> Norman, to -> Nate]|null                 |[from -> 1000, to -> 8000]|[from -> CherryHill, to -> Newark]|
|4     |[from ->, to -> Iowa]|[from ->, to -> Ian]        |[from ->, to -> 1004]|[from ->, to -> 8000]     |[from ->, to -> Des Moines]       |

Expected:

|emp_id|emp_city             |emp_name                    |emp_phone            |emp_sal                    |emp_site                          |

|1     |null                 |[from -> Will, to -> Watson, change -> update]|null                 |[from -> 1000, to -> 8000, change -> update]|[from ->, to -> Seattle, change -> insert]          |
|3     |null                 |[from -> Norman, to -> Nate, change -> update]|null                 |[from -> 1000, to -> 8000, change -> update]|[from -> CherryHill, to -> Newark, change -> update]|
|4     |[from ->, to -> Iowa, change -> insert]|[from ->, to -> Ian, change -> insert]        |[from ->, to -> 1004, change -> insert]|[from ->, to -> 8000, change -> insert]     |[from ->, to -> Des Moines, change -> insert]       |

You can do this via Row mapper function as below, Please find the code explanation inline

import org.apache.spark.sql.Row
object MapUpdater {

  def main(args: Array[String]): Unit = {

    val spark = Constant.getSparkSess

    import spark.implicits._

    //Load your data
    val df = List(
      (1,null,Map("from" ->"Will","to"-> "Watson"),null,Map("from" ->"1000","to"-> "8000"),Map("from" ->null,"to"-> "Seattle")),
      (2,null,Map("from" ->"Norman","to"-> "Nate"),null,Map("from" ->"1000","to"-> "8000"),Map("from" ->"CherryHill","to"-> "Newark")),
      (3,Map("from" ->null,"to"-> "Iowa"),Map("from" ->null,"to"-> "Ian"),Map("from" ->null,"to"-> "1004"),Map("from" ->"1000","to"-> "8000"),Map("from" ->null,"to"-> "Des Moines"))
    ).toDF("emp_id","emp_city","emp_name","emp_phone","emp_sal","emp_site")


    //Map each of your row
    df.map(row => {

      val new_emp_city = mapUpdater(row,1)
      val new_emp_name = mapUpdater(row,2)
      val new_emp_phone = mapUpdater(row,3)
      val new_emp_sal = mapUpdater(row,4)
      val new_emp_site = mapUpdater(row,5)

      (row.getInt(0),new_emp_city,new_emp_name,new_emp_phone,new_emp_sal,new_emp_site)

    }).toDF("emp_id","emp_city","emp_name","emp_phone","emp_sal","emp_site")
      .show(false)

  }

  //Row mapper function
  private def mapUpdater(row: Row,colId:Int): Map[String, String] = {
    val old_map = row.getAs[Map[String, String]](colId)

    val new_map: Map[String, String] = if (null != old_map) {
      if (null == old_map.getOrElse("from", null) && null != old_map.getOrElse("to", null)) {
        old_map + ("change" -> "Insert")
      } else if (null != old_map.getOrElse("from", null) && null == old_map.getOrElse("to", null)) {
        old_map + ("change" -> "Delete")
      } else old_map

    } else old_map
    (new_map)
  }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM