简体   繁体   中英

Spark : How do I find the passengers who have been on more than 3 flights together

I have a dataset as the following

passengerId, flightId, from, to, date
56,          0,        cg,   ir, 2017-01-01
78,          0,        cg,   ir, 2017-01-01
12,          0,        cg,   ir, 2017-02-01
34,          0,        cg,   ir, 2017-02-01
51,          0,        cg,   ir, 2017-02-01

56,          1,        ir,   uk, 2017-01-02
78,          1,        ir,   uk, 2017-01-02
12,          1,        ir,   uk, 2017-02-02
34,          1,        ir,   uk, 2017-02-02
51,          1,        ir,   uk, 2017-02-02

56,          2,        uk,   in, 2017-01-05
78,          2,        uk,   in, 2017-01-05
12,          2,        uk,   in, 2017-02-05
34,          2,        uk,   in, 2017-02-05
51,          3,        uk,   in, 2017-02-05

I need to present a report in the following formats.

Passenger 1 ID  Passenger 2 ID  No_flights_together
    56               78               6
    12               34               8
    …                 …               …

Find the passengers who have been on more than N flights together within the range

Passenger 1 ID  Passenger 2 ID  No_Flights_Together From        To
56                  78                    6         2017-01-01  2017-03-01
12                  34                    8         2017-04-05  2017-12-01
…                   …                     …         …           …

I'm not sure how to go about it. Help would be appreciated.

You can self-join on df1.passengerId < df2.passengerId along with same flightId and date , followed by performing the necessary count(*), min(date) and max(date) using groupBy/agg :

val df = Seq(
  (56, 0, "2017-01-01"),
  (78, 0, "2017-01-01"),
  (12, 0, "2017-02-01"),
  (34, 0, "2017-02-01"),
  (51, 0, "2017-02-01"),
  (56, 1, "2017-01-02"),
  (78, 1, "2017-01-02"),
  (12, 1, "2017-02-02"),
  (34, 1, "2017-02-02"),
  (51, 1, "2017-02-02"),
  (56, 2, "2017-01-05"),
  (78, 2, "2017-01-05"),
  (12, 2, "2017-02-01"),
  (34, 2, "2017-02-01"),
  (51, 3, "2017-02-01")
).toDF("passengerId", "flightId", "date")

df.as("df1").join(df.as("df2"),
    $"df1.passengerId" < $"df2.passengerId" &&
    $"df1.flightId" === $"df2.flightId" &&
    $"df1.date" === $"df2.date",
    "inner"
  ).
  groupBy($"df1.passengerId", $"df2.passengerId").
  agg(count("*").as("flightsTogether"), min($"df1.date").as("from"), max($"df1.date").as("to")).
  where($"flightsTogether" >= 3).
  show
// +-----------+-----------+---------------+----------+----------+
// |passengerId|passengerId|flightsTogether|      from|        to|
// +-----------+-----------+---------------+----------+----------+
// |         12|         34|              3|2017-02-01|2017-02-02|
// |         56|         78|              3|2017-01-01|2017-01-05|
// +-----------+-----------+---------------+----------+----------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM