简体   繁体   中英

How to add multiple columns in a spark dataframe using SCALA

I have a condition where I have to add 5 columns (to an existing DF) for 5 months of a year.

The existing DF is like:

EId EName Esal
1   abhi  1100
2   raj   300
3   nanu  400
4   ram   500

The Output should be as follows:

EId EName Esal Jan  Feb  March April May  
1   abhi  1100 1100 1100 1100  1100  1100 
2   raj   300  300  300  300   300   300  
3   nanu  400  400  400  400   400   400
4   ram   500  500  500  500   500   500

I can do this one by one with withColumn but that takes a lot of time.

Is there a way I can run some loop and keep on adding columns till my conditions are exhausted.

Many thanks in advance.

You can use foldLeft . You'll need to create a List of the columns that you want.

df.show
+---+----+----+
| id|name| sal|
+---+----+----+
|  1|   A|1100|
+---+----+----+

val list = List("Jan", "Feb" , "Mar", "Apr") // ... you get the idea

list.foldLeft(df)((df, month) => df.withColumn(month , $"sal" ) ).show
+---+----+----+----+----+----+----+
| id|name| sal| Jan| Feb| Mar| Apr|
+---+----+----+----+----+----+----+
|  1|   A|1100|1100|1100|1100|1100|
+---+----+----+----+----+----+----+

So, basically what happens is you fold the sequence you created while starting with the original dataframe and applying transformation as you keep on traversing through the list.

Yes , You can do the same using foldLeft.FoldLeft traverse the elements in the collection from left to right with the desired value.

So you can store the desired columns in a List(). For Example:

val BazarDF = Seq(
        ("Veg", "tomato", 1.99),
        ("Veg", "potato", 0.45),
        ("Fruit", "apple", 0.99),
        ("Fruit", "pineapple", 2.59)
         ).toDF("Type", "Item", "Price")

Create a List with column name and values(as an example used null value)

var ColNameWithDatatype = List(("Jan", lit("null").as("StringType")),
      ("Feb", lit("null").as("StringType")
     ))
var BazarWithColumnDF1 = ColNameWithDatatype.foldLeft(BazarDF) 
  { (tempDF, colName) =>
                     tempDF.withColumn(colName._1, colName._2)
                }

You can see the example Here

Have in mind that withColumn method of DataFrame could have performance issues when called in loop:

this method introduces a projection internally. Therefore, calling it multiple times, for instance, via loops in order to add multiple columns can generate big plans which can cause performance issues and even StackOverflowException. To avoid this, use select with the multiple columns at once.

The safer way is to do it with select:

val monthsColumns = months.map { month:String =>
  col("sal").as(month)
}
val updatedDf = df.select(df.columns.map(col) ++ monthsColumns: _*)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM