如何对 scala 数据集执行复杂的操作

Question

I am fairly new to scala and having come from a sql and pandas background the dataset objects in scala are giving me a bit of trouble.我对 scala 相当陌生，并且来自 sql 和 pandas 背景中的数据集对象 ZBAAD2C48E606FBC14C61ZD7 给我带来了麻烦。

I have a dataset that looks like the following...我有一个如下所示的数据集...

|car_num|      colour|
+-----------+---------+
|      145| c|
|      132| p|
|      104| u|
|      110| c|
|      110| f|
|      113| c|
|      115| c|
|       11| i|
|      117| s|
|      118| a|

I have loaded it as a dataset using a case class that looks like the following我已使用案例 class 将其加载为数据集，如下所示

case class carDS(carNum: String, Colour: String)

Each car_num is unique to a car, many of the cars have multiple entries.每个 car_num 对一辆车来说都是唯一的，许多汽车都有多个条目。 The colour column refers to the colour the car was painted.颜色栏是指汽车涂漆的颜色。

I would like to know how to add a column that gives the total number of paint jobs a car has had without being green (g) for example.例如，我想知道如何添加一个列，该列给出汽车在没有绿色 (g) 的情况下完成的油漆工作总数。

So far I have tried this.到目前为止，我已经尝试过了。

carDS
  .map(x => (x.carNum, x.Colour))
  .groupBy("_1")
  .count()
  .orderBy($"count".desc).show()

But I believe it just gives me a count column of the number of times the car was painted.但我相信它只是给了我汽车涂漆次数的计数栏。 Not the longest sequential amount of times the car was painted without being green.这不是汽车被涂漆而不是绿色的最长连续次数。

I think I might need to use a function in my query like the following我想我可能需要在查询中使用 function ，如下所示

def colourrun(sq: String): Int = {
  println(sq)
  sq.mkString(" ")
    .split("g")
    .filter(_.nonEmpty)
    .map(_.trim)
    .map(s => s.split(" ").length)
    .max
}

but I am unsure where it should go.但我不确定它应该在哪里 go。

Ultimately if car 102 had been painted r, b, g, b, o, y, r, g I would want the count column to give 4 as the answer.最终，如果汽车 102 被涂漆 r, b, g, b, o, y, r, g 我希望计数列给出 4 作为答案。

How would I do this?我该怎么做？ thanks谢谢

Answer 1

Here's one approach that involves grouping the paint jobs for a given car into monotonically numbered groups separated by paint jobs of color "g", followed by a couple of groupBy/agg s for the max count of paint jobs between being paint jobs of color "g".这是一种方法，涉及将给定汽车的油漆工作分组为由颜色“g”油漆工作分隔的单调编号组，然后是几个groupBy/agg s，用于在颜色油漆工作之间进行油漆工作的最大数量“ G”。

(Note that a timestamp column is being added to ensure a deterministic ordering of the rows in the dataset.) （请注意，正在添加timestamp列以确保数据集中行的确定性排序。）

val ds = Seq(
  ("102", "r", 1), ("102", "b", 2), ("102", "g", 3), ("102", "b", 4), ("102", "o", 5), ("102", "y", 6), ("102", "r", 7), ("102", "g", 8),
  ("145", "c", 1), ("145", "g", 2), ("145", "b", 3), ("145", "r", 4), ("145", "g", 5), ("145", "c", 6), ("145", "g", 7)
).toDF("car_num", "colour", "timestamp").as[(String, String, Long)]

import org.apache.spark.sql.expressions.Window
val win = Window.partitionBy("car_num").orderBy("timestamp")

ds.
  withColumn("group", sum(when($"colour" === "g", 1).otherwise(0)).over(win)).
  groupBy("car_num", "group").agg(
    when($"group" === 0, count("group")).otherwise(count("group") - 1).as("count")
  ).
  groupBy("car_num").agg(max("count").as("max_between_g")).
  show
// +-------+-------------+
// |car_num|max_between_g|
// +-------+-------------+
// |    102|            4|
// |    145|            2|
// +-------+-------------+

An alternative to using the DataFrame API is to apply groupByKey to the Dataset followed by mapGroups like below:使用 DataFrame API 的替代方法是将groupByKey应用于 Dataset，然后是mapGroups ，如下所示：

ds.
  map(c => (c.car_num, c.colour)).
  groupByKey(_._1).mapGroups{ case (k, iter) =>
    val maxTuple = iter.map(_._2).foldLeft((0, 0)){ case ((cnt, mx), c) =>
      if (c == "g") (0, math.max(cnt, mx)) else (cnt + 1, mx)
    }
    (k, maxTuple._2)
  }.
  show
  // +---+---+
  // | _1| _2|
  // +---+---+
  // |102|  4|
  // |145|  2|
  // +---+---+

如何对 scala 数据集执行复杂的操作

问题描述

1 个解决方案

解决方案1
2 已采纳 2021-05-11 06:44:26

如何对 scala 数据集执行复杂的操作

问题描述

1 个解决方案

解决方案1 2 已采纳 2021-05-11 06:44:26

解决方案1
2 已采纳 2021-05-11 06:44:26