[英]spark retain all the columns of orginal data frame after pivot
I have one data frame which has many columns almost 50 plus(as shown below), 我有一个数据框,其中有许多列,几乎有50多个(如下所示),
+----+----+---+----+----+---+----+---+----+----+---+...
|c1 |c2 |c3 |c4 |c5 |c6 |c7 |c8 |type|clm |val |...
+----+----+---+----+----+---+----+---+----+----+---+...
| 11| 5.0|3.0| 3.0| 3.0|4.0| 3.0|3.0| t1 | a |5 |...
+----+----+---+----+----+---+----+---+----+----+---+...
| 31| 5.0|3.0| 3.0| 3.0|4.0| 3.0|3.0| t2 | b |6 |...
+----+----+---+----+----+---+----+---+----+----+---+...
| 11| 5.0|3.0| 3.0| 3.0|4.0| 3.0|3.0| t1 | a |9 |...
+----+----+---+----+----+---+----+---+----+----+---+...
I want to convert one of the column values to many columns, so thinking to use below code 我想将一个列值转换为很多列,所以想使用下面的代码
df.groupBy("type").pivot("clm").agg(first("val")).show()
this is converting row values in to columns but other columns (c1 to c8) are not coming as part resultant data frame. 这会将行值转换为列,但其他列(c1至c8)不会作为结果数据帧的一部分出现。
so is it okay to do below method to get all cloumns after pivot 所以可以做下面的方法来获取透视后的所有cloumns
df.groupBy("c1","c2","c3","c4","c5","c6","c7","c8","type").pivot("clm").agg(first("val")).show() df.groupBy( “C1”, “C2”, “C3”, “C4”, “C5”, “C6”, “C7”, “C8”, “类型”)。枢轴( “CLM”)。AGG(第一( “VAL”))。节目()
pivot is treated like an aggregator, just like any other. 像其他任何数据集一样,pivot被视为聚合器。
df
.groupBy("type")
.agg(
pivot("clm").first("val"),
first("c1"),
first("c2"),
first("c3"),
first("c4"),
first("c5"),
first("c6"),
first("c7"),
first("c8")
).show()
Writing it like that assumes that you have duplicated values for c1..c8
within the same type
. 这样写就假设您在同一
type
具有c1..c8
重复值。 If not, then the .groupby(...)
needs to be tuned for exactly how your data is organized. 如果不是,则需要对
.groupby(...)
进行调整,以精确地确定数据的组织方式。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.