简体   繁体   English

Spark dataframe pivot 无聚合

[英]Spark dataframe pivot without aggregation

Am new to spark, currently am trying to do a pivot from rows to columns without aggregation like i need the data to be duplicated after the pivot.我是新手,目前正在尝试从行到列执行 pivot 而无需聚合,就像我需要在 pivot 之后复制数据一样。

i have the data like below我有如下数据

+----+----------+--------+
|abc |col       |position|
+----+----------+--------+
|1234|183500000X|0       |
|1234|0202211120|1       |
|1234|VA        |2       |
|1234|Y         |3       |
|1234|183500000X|0       |
|1234|21174     |1       |
|1234|NC        |2       |
|1234|N         |3       |
|1234|null      |0       |
|1234|null      |1       |
|1234|          |2       |
|1234|          |3       |
|1234|null      |0       |
|1234|null      |1       |
|1234|          |2       |
|1234|          |3       |

i would like to change it to below format我想将其更改为以下格式

+----------+----------+----------+---+---+
|   abc    |         0|         1|  2|  3|
+----------+----------+----------+---+---+
|1234      |183500000X|0202211120| VA|  Y|
|1234      |183500000X|21174     | NC|  N|
+----------+----------+----------+---+---+

whenever i tried to use df.groupBy($"abc").pivot("position").agg(first($"col")) ... am getting only one record instead of all.每当我尝试使用df.groupBy($"abc").pivot("position").agg(first($"col")) ... 我只得到一个记录而不是全部。

is there a way to get all the records without aggregation.有没有办法在不聚合的情况下获取所有记录。

do i need to join with another dataframe to pull out the data.. kindly suggest me on the same.我是否需要加入另一个 dataframe 才能提取数据.. 请同样建议我。

import org.apache.spark.sql.functions._

val df = spark.sparkContext.parallelize(Seq(
        ("A", "1", "0", 1),
        ("A", "2", "1", 1),
        ("A", "VA", "2", 1),
        ("A", "7", "3", 1),
        ("A", "11", "0", 2),
        ("A", "22", "1", 2),
        ("A", "VAVA", "2", 2),
        ("A", "77", "3", 2),
        ("B", "1", "0", 3),
        ("B", null, "1", 3)
      )).toDF("abc", "col", "position", "grouping")

val result = df.groupBy("abc", "grouping")
               .pivot("position")
               .agg(expr("first(col)"))
               .drop("grouping")

result.show()

+---+---+----+----+----+
|abc|  0|   1|   2|   3|
+---+---+----+----+----+
|  A|  1|   2|  VA|   7|
|  A| 11|  22|VAVA|  77|
|  B|  1|null|null|null|
+---+---+----+----+----+

Leaving the nulls aspect out, you need to do the following by adding a grouping number in some way to your data.除了空值方面,您需要通过以某种方式向数据添加分组编号来执行以下操作。 That is the clue.这就是线索。 That's a data wrangling exercise.这是一个数据争论的练习。 I do not know your data well enough to proffer advice.我不太了解您的数据,无法提供建议。

Supporting Notes支持说明

The issue is that the grouping looks to be sequential and we do not know if there are always with sets of 4 or N. How do we apply such a grouping?问题是分组看起来是连续的,我们不知道是否总是有 4 个或 N 个集合。我们如何应用这样的分组? You normally need a proper set of grouping keys, but we do not appear to have that here.您通常需要一组适当的分组键,但我们这里似乎没有。 Spark is less good at this type of thing and we need to preserve position, even with zipWithIndex this is a hard task. Spark 在这类事情上不太擅长,我们需要保留 position,即使使用 zipWithIndex 也是一项艰巨的任务。

The issue is more this than the.pivot in fact.实际上,问题不仅仅是.pivot。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM