Create new columns in Spark dataframe based on a hundred column pairs

Question

I am trying to create around 9-10 columns based on values in 100 of columns(sch0,shm2...shm100), however values of these columns would be the value in columns(idm0,idm1....idm100) which is part of same dataframe.

There are additional columns as well apart from these 2 pairs of 100. Problem is, not all the scheme (schm0,schm1..schm100) would have values in it and we have to traverse through each to find out the values and create the columns accordingly, 85+ columns would be empty most of the time so we need to ignore them.

Input dataframe example:

+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+
|col1|col2|col3|sch0|idsm0|schm1|idsm1|schm2|idsm2|schm3|idsm3|
+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+
|   a|   b|   c|   0|    1|    2|    3|    4|    5| null| null|
+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+

schm and idsm can go upto 100, so its basically the key-value pairs of 100 columns.

Expected output:

----+----+----+----------+-------+-------+
|col1|col2|col3|found_zero|found_2|found_4|
+----+----+----+----------+-------+-------+
|   a|   b|   c|         1|      3|      5|
+----+----+----+----------+-------+-------+

Note: There is no fixed value in any column, any columns can have any value and the columns that we create has to be based on value found in any of the scheme columns ( schm0 ... schm100 ) and the values in the created columns would be corresponding values of scheme ie idsymbol ( idsm0 ... idsm100 )

I am finding it difficult to formulate a plan to do it, any help would be greatly appreciated.

Edited- Adding another input example--

col1|col2|schm_0|idsm_0|schm_1|idsm_1|schm_2|idsm_2|schm_3|idsm_3|schm_4|idsm_4|schm_5|idsm_5| +----+----+------+------+------+------+------+------+------+------+------+------+------+------+ | 2| 6| b1| id1| i| id2| xs| id3| ch| id4| null| null| null| null| | 3| 5| b2| id5| x2| id6| ch| id7| be| id8| null| null| db| id15| | 4| 7| b1| id9| ch| id10| xs| id11| us| id12| null| null| null| null| +----+----+------+------+------+------+------+------+------+------+------+------+------+------+

for one particular record, col(schm_0,schm_1....schm_100) can have around 9 to 10 unique values as not all the columns would be populated with values.

we need to create 9 different columns based on the 9 unique values, so in short for one row, we need to iterate over each of 100 schmeme columns and collect all the values which is found there, based on found values, separate columns need to be created...and the values in those created columns would be the value in idsm(idsm_0,idsm_1....idsm_100)

ie if schm_0 has value 'cb' we need to create new column for eg 'col_cb' and value in this column 'col_cb' would be value in 'idsm_0' column. similarly we need to do for all 100 columns(we need to leave out the empty ones).

Expected output-

+----+----+------+------+-----+------+------+------+------+------+------+ |col1|col2|col_b1|col_b2|col_i|col_x2|col_ch|col_xs|col_be|col_us|col_db| +----+----+------+------+-----+------+------+------+------+------+------+ | 2| 6| id1| null| id2| null| id4| id3| null| null| null| | 3| 5| null| id5| null| id6| 1d7| null| id8| null| id15| | 4| 7| id9| null| null| null| 1d10| id11| null| id12| null| +----+----+------+------+-----+------+------+------+------+------+------+

Hope this clears the problem statements. Any help on this would be highly appreciated.

Answer 1

You can get the required output that you are expecting but would be a multi step process.

First you would have to create two separate dataframes out of the original dataframe ie one which contains schm columns and other which contains idsm columns. You will have to unpivot schm columns and idsm columns.

Then you would join both the dataframes based on unique combination of columns and filter the dataframe based on null values. You would then do group by based on unique columns and pivot on the schm columns and get the first value of the idsm columns.

//Sample Data
import org.apache.spark.sql.functions._
val initialdf = Seq((2,6,"b1","id1","i","id2","xs","id3","ch","id4",null,null,null,null),(3,5,"b2","id5","x2","id6","ch","id7","be","id8",null,null,"db","id15"),(4,7,"b1","id9","ch","id10","xs","id11","us","id12","es","id00",null,null)).toDF("col1","col2","schm_0","idsm_0","schm_1","idsm_1","schm_2","idsm_2","schm_3","idsm_3","schm_4","idsm_4","schm_5","idsm_5")
//creating two separate dataframes
val schmdf = initialdf.selectExpr("col1","col2", "stack(6, 'schm_0',schm_0, 'schm_1',schm_1,'schm_2',schm_2,'schm_3' ,schm_3, 'schm_4',schm_4,'schm_5',schm_5) as (schm,schm_value)").withColumn("id",split($"schm", "_")(1))
val idsmdf = initialdf.selectExpr("col1","col2", "stack(6, 'idsm_0',idsm_0, 'idsm_1',idsm_1,'idsm_2',idsm_2,'idsm_3' ,idsm_3, 'idsm_4',idsm_4,'idsm_5',idsm_5) as (idsm,idsm_value)").withColumn("id",split($"idsm", "_")(1))
//joining two dataframes and applying filter operation and giving alias for the column names to be used in next operation
val df = schmdf.join(idsmdf,Seq("col1","col2","id"),"inner").filter($"idsm_value" =!= "null").select("col1","col2","schm","schm_value","idsm","idsm_value").withColumn("schm_value", concat(lit("col_"),$"schm_value"))

df.groupBy("col1","col2").pivot("schm_value").agg(first("idsm_value")).show

you can see the output as below:

+----+----+------+------+------+------+------+------+-----+------+------+------+
|col1|col2|col_b1|col_b2|col_be|col_ch|col_db|col_es|col_i|col_us|col_x2|col_xs|
+----+----+------+------+------+------+------+------+-----+------+------+------+
|   2|   6|   id1|  null|  null|   id4|  null|  null|  id2|  null|  null|   id3|
|   3|   5|  null|   id5|   id8|   id7|  id15|  null| null|  null|   id6|  null|
|   4|   7|   id9|  null|  null|  id10|  null|  id00| null|  id12|  null|  id11|
+----+----+------+------+------+------+------+------+-----+------+------+------+

Create new columns in Spark dataframe based on a hundred column pairs

Question

1 answers

solution1
0 2021-12-17 13:19:38

Create new columns in Spark dataframe based on a hundred column pairs

Question

1 answers

solution1 0 2021-12-17 13:19:38

solution1
0 2021-12-17 13:19:38