![](/img/trans.png)
[英]spark dataframe : finding employees who is having salary more than the average salary of the organization
[英]Spark : How do I find the passengers who have been on more than 3 flights together
我有一个数据集如下
passengerId, flightId, from, to, date
56, 0, cg, ir, 2017-01-01
78, 0, cg, ir, 2017-01-01
12, 0, cg, ir, 2017-02-01
34, 0, cg, ir, 2017-02-01
51, 0, cg, ir, 2017-02-01
56, 1, ir, uk, 2017-01-02
78, 1, ir, uk, 2017-01-02
12, 1, ir, uk, 2017-02-02
34, 1, ir, uk, 2017-02-02
51, 1, ir, uk, 2017-02-02
56, 2, uk, in, 2017-01-05
78, 2, uk, in, 2017-01-05
12, 2, uk, in, 2017-02-05
34, 2, uk, in, 2017-02-05
51, 3, uk, in, 2017-02-05
我需要以下列格式提交一份报告。
Passenger 1 ID Passenger 2 ID No_flights_together
56 78 6
12 34 8
… … …
查找范围内一起乘坐过N次以上航班的乘客
Passenger 1 ID Passenger 2 ID No_Flights_Together From To
56 78 6 2017-01-01 2017-03-01
12 34 8 2017-04-05 2017-12-01
… … … … …
我不知道如何 go 关于它。 帮助将不胜感激。
您可以在df1.passengerId < df2.passengerId
以及相同flightId
和date
上自行加入,然后使用groupBy/agg
执行必要的 count(*)、min(date) 和 max(date):
val df = Seq(
(56, 0, "2017-01-01"),
(78, 0, "2017-01-01"),
(12, 0, "2017-02-01"),
(34, 0, "2017-02-01"),
(51, 0, "2017-02-01"),
(56, 1, "2017-01-02"),
(78, 1, "2017-01-02"),
(12, 1, "2017-02-02"),
(34, 1, "2017-02-02"),
(51, 1, "2017-02-02"),
(56, 2, "2017-01-05"),
(78, 2, "2017-01-05"),
(12, 2, "2017-02-01"),
(34, 2, "2017-02-01"),
(51, 3, "2017-02-01")
).toDF("passengerId", "flightId", "date")
df.as("df1").join(df.as("df2"),
$"df1.passengerId" < $"df2.passengerId" &&
$"df1.flightId" === $"df2.flightId" &&
$"df1.date" === $"df2.date",
"inner"
).
groupBy($"df1.passengerId", $"df2.passengerId").
agg(count("*").as("flightsTogether"), min($"df1.date").as("from"), max($"df1.date").as("to")).
where($"flightsTogether" >= 3).
show
// +-----------+-----------+---------------+----------+----------+
// |passengerId|passengerId|flightsTogether| from| to|
// +-----------+-----------+---------------+----------+----------+
// | 12| 34| 3|2017-02-01|2017-02-02|
// | 56| 78| 3|2017-01-01|2017-01-05|
// +-----------+-----------+---------------+----------+----------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.