简体   繁体   English

Spark:我如何找到一起乘坐超过 3 个航班的乘客

[英]Spark : How do I find the passengers who have been on more than 3 flights together

I have a dataset as the following我有一个数据集如下

passengerId, flightId, from, to, date
56,          0,        cg,   ir, 2017-01-01
78,          0,        cg,   ir, 2017-01-01
12,          0,        cg,   ir, 2017-02-01
34,          0,        cg,   ir, 2017-02-01
51,          0,        cg,   ir, 2017-02-01

56,          1,        ir,   uk, 2017-01-02
78,          1,        ir,   uk, 2017-01-02
12,          1,        ir,   uk, 2017-02-02
34,          1,        ir,   uk, 2017-02-02
51,          1,        ir,   uk, 2017-02-02

56,          2,        uk,   in, 2017-01-05
78,          2,        uk,   in, 2017-01-05
12,          2,        uk,   in, 2017-02-05
34,          2,        uk,   in, 2017-02-05
51,          3,        uk,   in, 2017-02-05

I need to present a report in the following formats.我需要以下列格式提交一份报告。

Passenger 1 ID  Passenger 2 ID  No_flights_together
    56               78               6
    12               34               8
    …                 …               …

Find the passengers who have been on more than N flights together within the range查找范围内一起乘坐过N次以上航班的乘客

Passenger 1 ID  Passenger 2 ID  No_Flights_Together From        To
56                  78                    6         2017-01-01  2017-03-01
12                  34                    8         2017-04-05  2017-12-01
…                   …                     …         …           …

I'm not sure how to go about it.我不知道如何 go 关于它。 Help would be appreciated.帮助将不胜感激。

You can self-join on df1.passengerId < df2.passengerId along with same flightId and date , followed by performing the necessary count(*), min(date) and max(date) using groupBy/agg :您可以在df1.passengerId < df2.passengerId以及相同flightIddate上自行加入,然后使用groupBy/agg执行必要的 count(*)、min(date) 和 max(date):

val df = Seq(
  (56, 0, "2017-01-01"),
  (78, 0, "2017-01-01"),
  (12, 0, "2017-02-01"),
  (34, 0, "2017-02-01"),
  (51, 0, "2017-02-01"),
  (56, 1, "2017-01-02"),
  (78, 1, "2017-01-02"),
  (12, 1, "2017-02-02"),
  (34, 1, "2017-02-02"),
  (51, 1, "2017-02-02"),
  (56, 2, "2017-01-05"),
  (78, 2, "2017-01-05"),
  (12, 2, "2017-02-01"),
  (34, 2, "2017-02-01"),
  (51, 3, "2017-02-01")
).toDF("passengerId", "flightId", "date")

df.as("df1").join(df.as("df2"),
    $"df1.passengerId" < $"df2.passengerId" &&
    $"df1.flightId" === $"df2.flightId" &&
    $"df1.date" === $"df2.date",
    "inner"
  ).
  groupBy($"df1.passengerId", $"df2.passengerId").
  agg(count("*").as("flightsTogether"), min($"df1.date").as("from"), max($"df1.date").as("to")).
  where($"flightsTogether" >= 3).
  show
// +-----------+-----------+---------------+----------+----------+
// |passengerId|passengerId|flightsTogether|      from|        to|
// +-----------+-----------+---------------+----------+----------+
// |         12|         34|              3|2017-02-01|2017-02-02|
// |         56|         78|              3|2017-01-01|2017-01-05|
// +-----------+-----------+---------------+----------+----------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 火花数据框架:查找薪水高于组织平均薪水的员工 - spark dataframe : finding employees who is having salary more than the average salary of the organization 我可以在 Scala 中将两个以上的列表压缩在一起吗? - Can I zip more than two lists together in Scala? 为什么我们需要的执行者多于Spark中的机器数量? - Why do we need more executors than number of machines in Spark? 如何有条件地重复渲染多个项目 - How do I conditionally render more than one item in a repeat 如何使用joinWith联接两个以上的数据集? - How do I use joinWith to join more than 2 datasets? 如何使用scala spark从没有标题且超过150列的csv创建数据集 - How to create a Dataset from a csv which doesn't have a header and has more than 150 columns using scala spark 如何在Apache Spark sql数据帧中找到每个行的大小,并发现大小超过阈值大小的行(千字节) - How can find size of each Row in Apache spark sql dataframe and discrad the rows having size more than a threshold size in Kilobyte Scala:如何按行联接多个Spark Dataframe? - Scala: How to join more than one Spark Dataframe by rows? Scala:如何找到超过2个元素的最小值? - Scala: How to find the minimum of more than 2 elements? 如何在不止一列上使用 pivot 以获得火花 dataframe? - How to pivot on more than one column for a spark dataframe?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM