简体   繁体   English

Spark 2.2 数据帧 [scala]

[英]Spark 2.2 dataframe [scala]

OrderNo    Status1    Status2     Status3
123    Completed      Pending     Pending
456    Rejected   Completed   Completed
789    Pending    In Progress     Completed

Above is the table which is the input data set and the expected output is below.上面是输入数据集的表格,下面是预期的输出。 The catch here is we should count based on the order no and not by no of status occurrences.这里的问题是我们应该根据订单号而不是状态出现次数来计算。 Can we do this with the help of spark dataframes using scala?我们可以在使用 scala 的 spark 数据帧的帮助下做到这一点吗? Appreciate your help in advance.提前感谢您的帮助。

Pending     2
Rejected    1
Completed   3
In Progress 2

You can try the following code.你可以试试下面的代码。 It counts the number of distinct OrderNo for all the status.它计算所有状态的不同 OrderNo 的数量。 I hope it helps.我希望它有帮助。

val rawDF = Seq(
  ("123", "Completed", "Pending", "Pending"),
  ("456", "Rejected", "Completed", "Completed"),
  ("789", "Pending", "In Progress", "Completed")
).toDF("OrderNo", "Status1", "Status2", "Status3")

val newDF = rawDF.withColumn("All_Status",  array($"Status1", $"Status2", $"Status3"))
    .withColumn("Status", explode($"All_Status"))
    .groupBy("Status").agg(size(collect_set($"OrderNo")).as("DistOrderCnt"))

Here are the results.这是结果。 (Note: In Progress only appears once in test data.) (注意:In Progress 在测试数据中只出现一次。)

+-----------+------------+ | Status|DistOrderCnt| +-----------+------------+ | Completed| 3| |In Progress| 1| | Pending| 2| | Rejected| 1| +-----------+------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM