Spark 2.2 数据帧 [scala]

Question

OrderNo    Status1    Status2     Status3
123    Completed      Pending     Pending
456    Rejected   Completed   Completed
789    Pending    In Progress     Completed

Above is the table which is the input data set and the expected output is below.上面是输入数据集的表格，下面是预期的输出。 The catch here is we should count based on the order no and not by no of status occurrences.这里的问题是我们应该根据订单号而不是状态出现次数来计算。 Can we do this with the help of spark dataframes using scala?我们可以在使用 scala 的 spark 数据帧的帮助下做到这一点吗？ Appreciate your help in advance.提前感谢您的帮助。

Pending     2
Rejected    1
Completed   3
In Progress 2

Answer 1

You can try the following code.你可以试试下面的代码。 It counts the number of distinct OrderNo for all the status.它计算所有状态的不同 OrderNo 的数量。 I hope it helps.我希望它有帮助。

val rawDF = Seq(
  ("123", "Completed", "Pending", "Pending"),
  ("456", "Rejected", "Completed", "Completed"),
  ("789", "Pending", "In Progress", "Completed")
).toDF("OrderNo", "Status1", "Status2", "Status3")

val newDF = rawDF.withColumn("All_Status",  array($"Status1", $"Status2", $"Status3"))
    .withColumn("Status", explode($"All_Status"))
    .groupBy("Status").agg(size(collect_set($"OrderNo")).as("DistOrderCnt"))

Here are the results.这是结果。 (Note: In Progress only appears once in test data.) （注意：In Progress 在测试数据中只出现一次。）

+-----------+------------+ | Status|DistOrderCnt| +-----------+------------+ | Completed| 3| |In Progress| 1| | Pending| 2| | Rejected| 1| +-----------+------------+

Spark 2.2 数据帧 [scala]

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-01-17 10:09:49

Spark 2.2 数据帧 [scala]

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-01-17 10:09:49

解决方案1
0 已采纳 2019-01-17 10:09:49