简体   繁体   English

枢轴旋转后,spark保留原始数据框的所有列

[英]spark retain all the columns of orginal data frame after pivot

I have one data frame which has many columns almost 50 plus(as shown below), 我有一个数据框,其中有许多列,几乎有50多个(如下所示),

+----+----+---+----+----+---+----+---+----+----+---+...
|c1  |c2  |c3 |c4  |c5  |c6  |c7 |c8 |type|clm |val |...
+----+----+---+----+----+---+----+---+----+----+---+...
|  11| 5.0|3.0| 3.0| 3.0|4.0| 3.0|3.0| t1 | a  |5   |...
+----+----+---+----+----+---+----+---+----+----+---+...
|  31| 5.0|3.0| 3.0| 3.0|4.0| 3.0|3.0| t2 | b  |6   |...
+----+----+---+----+----+---+----+---+----+----+---+...
|  11| 5.0|3.0| 3.0| 3.0|4.0| 3.0|3.0| t1 | a  |9   |...
+----+----+---+----+----+---+----+---+----+----+---+...

I want to convert one of the column values to many columns, so thinking to use below code 我想将一个列值转换为很多列,所以想使用下面的代码

df.groupBy("type").pivot("clm").agg(first("val")).show() 

this is converting row values in to columns but other columns (c1 to c8) are not coming as part resultant data frame. 这会将行值转换为列,但其他列(c1至c8)不会作为结果数据帧的一部分出现。

so is it okay to do below method to get all cloumns after pivot 所以可以做下面的方法来获取透视后的所有cloumns

df.groupBy("c1","c2","c3","c4","c5","c6","c7","c8","type").pivot("clm").agg(first("val")).show() df.groupBy( “C1”, “C2”, “C3”, “C4”, “C5”, “C6”, “C7”, “C8”, “类型”)。枢轴( “CLM”)。AGG(第一( “VAL”))。节目()

pivot is treated like an aggregator, just like any other. 像其他任何数据集一样,pivot被视为聚合器。

df
  .groupBy("type")
  .agg(
    pivot("clm").first("val"),
    first("c1"),
    first("c2"),
    first("c3"),
    first("c4"),
    first("c5"),
    first("c6"),
    first("c7"),
    first("c8")
  ).show()

Writing it like that assumes that you have duplicated values for c1..c8 within the same type . 这样写就假设您在同一type具有c1..c8重复值。 If not, then the .groupby(...) needs to be tuned for exactly how your data is organized. 如果不是,则需要对.groupby(...)进行调整,以精确地确定数据的组织方式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 分组并依靠Spark Data框架的所有列 - Group by and count on Spark Data frame all columns 为什么在Scala Spark中进行数据帧连接后,外部连接不保留所有提到的列? - Why outer join does not preserve all mentioned columns after data frame join in scala spark? 使用折叠连接所有数据框列,并通过Spark / Scala减少 - Concat of all data frame columns using fold, reduce with Spark / Scala 如何在火花数据框列中保留前导零 - How to retain leading zero in spark data frame column [spark-scalapi]通过spark数据框分组后计算多列与某一特定列的相关性 - [spark-scalapi]calculate correlation between multiple columns and one specific column after groupby the spark data frame 比较spark中两个数据框中的列 - Comparing columns in two data frame in spark Spark:具有多列的数据透视 - Spark : Pivot with multiple columns 如何为所有列编写withColumnRenamed并在Spark数据帧的自定义分区中加入两个不同的架构 - How to write withColumnRenamed for all columns and join two different schema in custom partition in spark data frame 使用Spark Scala进行透视后,按名称选择具有多个聚合列的列 - Select column by name with multiple aggregate columns after pivot with Spark Scala 从Spark数据框中更简洁地选择多列 - Select multiple columns from a spark data frame more concisely
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM