简体   繁体   中英

how to pivot /transpose rows of a column in to individual columns in spark-scala without using the pivot method

Please check below image for the reference to my use case

在此处输入图片说明

Use groupBy , pivot & agg functions. Check below code. Added inline comments.

scala> df.show(false)
+----------+------+----+
|tdate     |ttype |tamt|
+----------+------+----+
|2020-10-15|draft |5000|
|2020-10-18|cheque|7000|
+----------+------+----+
scala> df
.groupBy($"tdate") // Grouping data based on tdate column.
.pivot("ttype",Seq("cheque","draft")) // pivot based on ttype and "draft","cheque" are new column name
.agg(first("tamt")) // aggregation by "tamt" column.
.show(false)

+----------+------+-----+
|tdate     |cheque|draft|
+----------+------+-----+
|2020-10-18|7000  |null |
|2020-10-15|null  |5000 |
+----------+------+-----+

You can get the same result without using pivot by adding the columns manually, if you know all the names of the new columns:

import org.apache.spark.sql.functions.{col, when}

dataframe
  .withColumn("cheque", when(col("ttype") === "cheque", col("tamt")))
  .withColumn("draft", when(col("ttype") === "draft", col("tamt")))
  .drop("tamt", "ttype")

As this solution does not trigger shuffle, your processing will be faster than using pivot.

It can be generalized if you don't know the name of the columns. However, in this case you should benchmark to check whether pivot is more performant:

import org.apache.spark.sql.functions.{col, when}

val newColumnNames = dataframe.select("ttype").distinct.collect().map(_.getString(0))

newColumnNames
  .foldLeft(dataframe)((df, columnName) => {
    df.withColumn(columnName, when(col("ttype") === columnName, col("tamt")))
  })
  .drop("tamt", "ttype")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM