简体   繁体   English

基于一百列对在Spark dataframe中创建新列

[英]Create new columns in Spark dataframe based on a hundred column pairs

I am trying to create around 9-10 columns based on values in 100 of columns(sch0,shm2...shm100), however values of these columns would be the value in columns(idm0,idm1....idm100) which is part of same dataframe.我正在尝试根据 100 列(sch0,shm2...shm100)中的值创建大约 9-10 列,但是这些列的值将是列(idm0,idm1 ....idm100)中的值,即相同 dataframe 的一部分。

There are additional columns as well apart from these 2 pairs of 100. Problem is, not all the scheme (schm0,schm1..schm100) would have values in it and we have to traverse through each to find out the values and create the columns accordingly, 85+ columns would be empty most of the time so we need to ignore them.除了这 2 对 100 之外,还有其他列。问题是,并非所有方案 (schm0,schm1..schm100) 都包含值,我们必须遍历每个方案以找出值并创建列因此,85+ 列大部分时间都是空的,所以我们需要忽略它们。

Input dataframe example:输入 dataframe 示例:

+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+
|col1|col2|col3|sch0|idsm0|schm1|idsm1|schm2|idsm2|schm3|idsm3|
+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+
|   a|   b|   c|   0|    1|    2|    3|    4|    5| null| null|
+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+

schm and idsm can go upto 100, so its basically the key-value pairs of 100 columns. schmidsm可以 go 最多 100,所以它基本上是 100 列的键值对。

Expected output:预期 output:

----+----+----+----------+-------+-------+
|col1|col2|col3|found_zero|found_2|found_4|
+----+----+----+----------+-------+-------+
|   a|   b|   c|         1|      3|      5|
+----+----+----+----------+-------+-------+

Note: There is no fixed value in any column, any columns can have any value and the columns that we create has to be based on value found in any of the scheme columns ( schm0 ... schm100 ) and the values in the created columns would be corresponding values of scheme ie idsymbol ( idsm0 ... idsm100 )注意:任何列中都没有固定值,任何列都可以有任何值,并且我们创建的列必须基于在任何方案列( schm0 ... schm100 )中找到的值以及创建的列中的值将是方案的对应值,即 idsymbol ( idsm0 ... idsm100 )

I am finding it difficult to formulate a plan to do it, any help would be greatly appreciated.我发现很难制定一个计划来做到这一点,任何帮助将不胜感激。

Edited- Adding another input example--已编辑-添加另一个输入示例-

col1|col2|schm_0|idsm_0|schm_1|idsm_1|schm_2|idsm_2|schm_3|idsm_3|schm_4|idsm_4|schm_5|idsm_5| col1|col2|schm_0|idsm_0|schm_1|idsm_1|schm_2|idsm_2|schm_3|idsm_3|schm_4|idsm_4|schm_5|idsm_5| +----+----+------+------+------+------+------+------+------+------+------+------+------+------+ | +----+----+------+------+------+------+------+---- --+------+------+------+------+------+------+ | 2| 2| 6| 6| b1| b1| id1| id1| i|我| id2| id2| xs| xs| id3| id3| ch|通道| id4| id4| null|空| null|空| null|空| null|空| | | 3| 3| 5| 5| b2| b2| id5| id5| x2| x2| id6| id6| ch|通道| id7| id7| be|是| id8| id8| null|空| null|空| db|数据库| id15| id15| | | 4| 4| 7| 7| b1| b1| id9| id9| ch|通道| id10| id10| xs| xs| id11| id11| us|我们| id12| id12| null|空| null|空| null|空| null|空| +----+----+------+------+------+------+------+------+------+------+------+------+------+------+ +----+----+------+------+------+------+------+---- --+-----+------+------+------+------+------+

for one particular record, col(schm_0,schm_1....schm_100) can have around 9 to 10 unique values as not all the columns would be populated with values.对于一个特定的记录, col(schm_0,schm_1....schm_100) 可以有大约 9 到 10 个唯一值,因为并非所有列都会填充值。

we need to create 9 different columns based on the 9 unique values, so in short for one row, we need to iterate over each of 100 schmeme columns and collect all the values which is found there, based on found values, separate columns need to be created...and the values in those created columns would be the value in idsm(idsm_0,idsm_1....idsm_100)我们需要根据 9 个唯一值创建 9 个不同的列,因此简而言之,对于一行,我们需要遍历 100 个 schmeme 列中的每一个并收集在那里找到的所有值,根据找到的值,单独的列需要被创建...并且这些创建列中的值将是 idsm(idsm_0,idsm_1....idsm_100) 中的值

ie if schm_0 has value 'cb' we need to create new column for eg 'col_cb' and value in this column 'col_cb' would be value in 'idsm_0' column.即,如果 schm_0 具有值“cb”,我们需要为例如“col_cb”创建新列,并且该列“col_cb”中的值将是“idsm_0”列中的值。 similarly we need to do for all 100 columns(we need to leave out the empty ones).同样,我们需要对所有 100 列执行此操作(我们需要省略空列)。

Expected output-预期产出-

+----+----+------+------+-----+------+------+------+------+------+------+ |col1|col2|col_b1|col_b2|col_i|col_x2|col_ch|col_xs|col_be|col_us|col_db| +----+----+------+------+-----+------+------+----- -+------+------+------+ |col1|col2|col_b1|col_b2|col_i|col_x2|col_ch|col_xs|col_be|col_us|col_db| +----+----+------+------+-----+------+------+------+------+------+------+ | +----+----+------+------+-----+------+------+----- -+-----+------+------+ | 2| 2| 6| 6| id1| id1| null|空| id2| id2| null|空| id4| id4| id3| id3| null|空| null|空| null|空| | | 3| 3| 5| 5| null|空| id5| id5| null|空| id6| id6| 1d7| 1d7| null|空| id8| id8| null|空| id15| id15| | | 4| 4| 7| 7| id9| id9| null|空| null|空| null|空| 1d10| 1d10| id11| id11| null|空| id12| id12| null|空| +----+----+------+------+-----+------+------+------+------+------+------+ +----+----+------+------+-----+------+------+----- -+--------+------+------+

Hope this clears the problem statements.希望这可以清除问题陈述。 Any help on this would be highly appreciated.对此的任何帮助将不胜感激。

You can get the required output that you are expecting but would be a multi step process.您可以获得所需的 output ,但这将是一个多步骤的过程。

First you would have to create two separate dataframes out of the original dataframe ie one which contains schm columns and other which contains idsm columns.首先,您必须从原始 dataframe 中创建两个单独的数据帧,即一个包含 schm 列,另一个包含 idsm 列。 You will have to unpivot schm columns and idsm columns.您将不得不取消透视 schm 列和 idsm 列。

Then you would join both the dataframes based on unique combination of columns and filter the dataframe based on null values.然后,您将根据列的唯一组合加入两个数据帧,并根据 null 值过滤 dataframe。 You would then do group by based on unique columns and pivot on the schm columns and get the first value of the idsm columns.然后,您将根据唯一列和 schm 列上的 pivot 进行分组,并获取 idsm 列的第一个值。

//Sample Data
import org.apache.spark.sql.functions._
val initialdf = Seq((2,6,"b1","id1","i","id2","xs","id3","ch","id4",null,null,null,null),(3,5,"b2","id5","x2","id6","ch","id7","be","id8",null,null,"db","id15"),(4,7,"b1","id9","ch","id10","xs","id11","us","id12","es","id00",null,null)).toDF("col1","col2","schm_0","idsm_0","schm_1","idsm_1","schm_2","idsm_2","schm_3","idsm_3","schm_4","idsm_4","schm_5","idsm_5")
//creating two separate dataframes
val schmdf = initialdf.selectExpr("col1","col2", "stack(6, 'schm_0',schm_0, 'schm_1',schm_1,'schm_2',schm_2,'schm_3' ,schm_3, 'schm_4',schm_4,'schm_5',schm_5) as (schm,schm_value)").withColumn("id",split($"schm", "_")(1))
val idsmdf = initialdf.selectExpr("col1","col2", "stack(6, 'idsm_0',idsm_0, 'idsm_1',idsm_1,'idsm_2',idsm_2,'idsm_3' ,idsm_3, 'idsm_4',idsm_4,'idsm_5',idsm_5) as (idsm,idsm_value)").withColumn("id",split($"idsm", "_")(1))
//joining two dataframes and applying filter operation and giving alias for the column names to be used in next operation
val df = schmdf.join(idsmdf,Seq("col1","col2","id"),"inner").filter($"idsm_value" =!= "null").select("col1","col2","schm","schm_value","idsm","idsm_value").withColumn("schm_value", concat(lit("col_"),$"schm_value"))

df.groupBy("col1","col2").pivot("schm_value").agg(first("idsm_value")).show

you can see the output as below:您可以看到 output 如下:

+----+----+------+------+------+------+------+------+-----+------+------+------+
|col1|col2|col_b1|col_b2|col_be|col_ch|col_db|col_es|col_i|col_us|col_x2|col_xs|
+----+----+------+------+------+------+------+------+-----+------+------+------+
|   2|   6|   id1|  null|  null|   id4|  null|  null|  id2|  null|  null|   id3|
|   3|   5|  null|   id5|   id8|   id7|  id15|  null| null|  null|   id6|  null|
|   4|   7|   id9|  null|  null|  id10|  null|  id00| null|  id12|  null|  id11|
+----+----+------+------+------+------+------+------+-----+------+------+------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM