繁体   English   中英

Spark:我如何找到一起乘坐超过 3 个航班的乘客

[英]Spark : How do I find the passengers who have been on more than 3 flights together

我有一个数据集如下

passengerId, flightId, from, to, date
56,          0,        cg,   ir, 2017-01-01
78,          0,        cg,   ir, 2017-01-01
12,          0,        cg,   ir, 2017-02-01
34,          0,        cg,   ir, 2017-02-01
51,          0,        cg,   ir, 2017-02-01

56,          1,        ir,   uk, 2017-01-02
78,          1,        ir,   uk, 2017-01-02
12,          1,        ir,   uk, 2017-02-02
34,          1,        ir,   uk, 2017-02-02
51,          1,        ir,   uk, 2017-02-02

56,          2,        uk,   in, 2017-01-05
78,          2,        uk,   in, 2017-01-05
12,          2,        uk,   in, 2017-02-05
34,          2,        uk,   in, 2017-02-05
51,          3,        uk,   in, 2017-02-05

我需要以下列格式提交一份报告。

Passenger 1 ID  Passenger 2 ID  No_flights_together
    56               78               6
    12               34               8
    …                 …               …

查找范围内一起乘坐过N次以上航班的乘客

Passenger 1 ID  Passenger 2 ID  No_Flights_Together From        To
56                  78                    6         2017-01-01  2017-03-01
12                  34                    8         2017-04-05  2017-12-01
…                   …                     …         …           …

我不知道如何 go 关于它。 帮助将不胜感激。

您可以在df1.passengerId < df2.passengerId以及相同flightIddate上自行加入,然后使用groupBy/agg执行必要的 count(*)、min(date) 和 max(date):

val df = Seq(
  (56, 0, "2017-01-01"),
  (78, 0, "2017-01-01"),
  (12, 0, "2017-02-01"),
  (34, 0, "2017-02-01"),
  (51, 0, "2017-02-01"),
  (56, 1, "2017-01-02"),
  (78, 1, "2017-01-02"),
  (12, 1, "2017-02-02"),
  (34, 1, "2017-02-02"),
  (51, 1, "2017-02-02"),
  (56, 2, "2017-01-05"),
  (78, 2, "2017-01-05"),
  (12, 2, "2017-02-01"),
  (34, 2, "2017-02-01"),
  (51, 3, "2017-02-01")
).toDF("passengerId", "flightId", "date")

df.as("df1").join(df.as("df2"),
    $"df1.passengerId" < $"df2.passengerId" &&
    $"df1.flightId" === $"df2.flightId" &&
    $"df1.date" === $"df2.date",
    "inner"
  ).
  groupBy($"df1.passengerId", $"df2.passengerId").
  agg(count("*").as("flightsTogether"), min($"df1.date").as("from"), max($"df1.date").as("to")).
  where($"flightsTogether" >= 3).
  show
// +-----------+-----------+---------------+----------+----------+
// |passengerId|passengerId|flightsTogether|      from|        to|
// +-----------+-----------+---------------+----------+----------+
// |         12|         34|              3|2017-02-01|2017-02-02|
// |         56|         78|              3|2017-01-01|2017-01-05|
// +-----------+-----------+---------------+----------+----------+

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM