I have a condition where I have to add 5 columns (to an existing DF) for 5 months of a year.
The existing DF is like:
EId EName Esal
1 abhi 1100
2 raj 300
3 nanu 400
4 ram 500
The Output should be as follows:
EId EName Esal Jan Feb March April May
1 abhi 1100 1100 1100 1100 1100 1100
2 raj 300 300 300 300 300 300
3 nanu 400 400 400 400 400 400
4 ram 500 500 500 500 500 500
I can do this one by one with withColumn but that takes a lot of time.
Is there a way I can run some loop and keep on adding columns till my conditions are exhausted.
Many thanks in advance.
You can use foldLeft
. You'll need to create a List
of the columns that you want.
df.show
+---+----+----+
| id|name| sal|
+---+----+----+
| 1| A|1100|
+---+----+----+
val list = List("Jan", "Feb" , "Mar", "Apr") // ... you get the idea
list.foldLeft(df)((df, month) => df.withColumn(month , $"sal" ) ).show
+---+----+----+----+----+----+----+
| id|name| sal| Jan| Feb| Mar| Apr|
+---+----+----+----+----+----+----+
| 1| A|1100|1100|1100|1100|1100|
+---+----+----+----+----+----+----+
So, basically what happens is you fold the sequence you created while starting with the original dataframe and applying transformation as you keep on traversing through the list.
Yes , You can do the same using foldLeft.FoldLeft traverse the elements in the collection from left to right with the desired value.
So you can store the desired columns in a List(). For Example:
val BazarDF = Seq(
("Veg", "tomato", 1.99),
("Veg", "potato", 0.45),
("Fruit", "apple", 0.99),
("Fruit", "pineapple", 2.59)
).toDF("Type", "Item", "Price")
Create a List with column name and values(as an example used null value)
var ColNameWithDatatype = List(("Jan", lit("null").as("StringType")),
("Feb", lit("null").as("StringType")
))
var BazarWithColumnDF1 = ColNameWithDatatype.foldLeft(BazarDF)
{ (tempDF, colName) =>
tempDF.withColumn(colName._1, colName._2)
}
You can see the example Here
Have in mind that withColumn
method of DataFrame
could have performance issues when called in loop:
this method introduces a projection internally. Therefore, calling it multiple times, for instance, via loops in order to add multiple columns can generate big plans which can cause performance issues and even StackOverflowException. To avoid this, use select with the multiple columns at once.
The safer way is to do it with select:
val monthsColumns = months.map { month:String =>
col("sal").as(month)
}
val updatedDf = df.select(df.columns.map(col) ++ monthsColumns: _*)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.