I have a dataset as the following
passengerId, flightId, from, to, date
56, 0, cg, ir, 2017-01-01
78, 0, cg, ir, 2017-01-01
12, 0, cg, ir, 2017-02-01
34, 0, cg, ir, 2017-02-01
51, 0, cg, ir, 2017-02-01
56, 1, ir, uk, 2017-01-02
78, 1, ir, uk, 2017-01-02
12, 1, ir, uk, 2017-02-02
34, 1, ir, uk, 2017-02-02
51, 1, ir, uk, 2017-02-02
56, 2, uk, in, 2017-01-05
78, 2, uk, in, 2017-01-05
12, 2, uk, in, 2017-02-05
34, 2, uk, in, 2017-02-05
51, 3, uk, in, 2017-02-05
I need to present a report in the following formats.
Passenger 1 ID Passenger 2 ID No_flights_together
56 78 6
12 34 8
… … …
Find the passengers who have been on more than N flights together within the range
Passenger 1 ID Passenger 2 ID No_Flights_Together From To
56 78 6 2017-01-01 2017-03-01
12 34 8 2017-04-05 2017-12-01
… … … … …
I'm not sure how to go about it. Help would be appreciated.
You can self-join on df1.passengerId < df2.passengerId
along with same flightId
and date
, followed by performing the necessary count(*), min(date) and max(date) using groupBy/agg
:
val df = Seq(
(56, 0, "2017-01-01"),
(78, 0, "2017-01-01"),
(12, 0, "2017-02-01"),
(34, 0, "2017-02-01"),
(51, 0, "2017-02-01"),
(56, 1, "2017-01-02"),
(78, 1, "2017-01-02"),
(12, 1, "2017-02-02"),
(34, 1, "2017-02-02"),
(51, 1, "2017-02-02"),
(56, 2, "2017-01-05"),
(78, 2, "2017-01-05"),
(12, 2, "2017-02-01"),
(34, 2, "2017-02-01"),
(51, 3, "2017-02-01")
).toDF("passengerId", "flightId", "date")
df.as("df1").join(df.as("df2"),
$"df1.passengerId" < $"df2.passengerId" &&
$"df1.flightId" === $"df2.flightId" &&
$"df1.date" === $"df2.date",
"inner"
).
groupBy($"df1.passengerId", $"df2.passengerId").
agg(count("*").as("flightsTogether"), min($"df1.date").as("from"), max($"df1.date").as("to")).
where($"flightsTogether" >= 3).
show
// +-----------+-----------+---------------+----------+----------+
// |passengerId|passengerId|flightsTogether| from| to|
// +-----------+-----------+---------------+----------+----------+
// | 12| 34| 3|2017-02-01|2017-02-02|
// | 56| 78| 3|2017-01-01|2017-01-05|
// +-----------+-----------+---------------+----------+----------+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.