自制DataFrame聚合/ dropDuplicates Spark

Question

I want to perform a transformation on my DataFrame df so that I only have each key once and only once in the final DataFrame. 我想在DataFrame df上执行转换，以便每个键只有一次，而在最终DataFrame中只有一次。

For machine learning purposes, I don't want to have a bias in my dataset. 出于机器学习的目的，我不想在数据集中出现偏见。 This should never occur, but the data I get from my data source contains this "weirdness". 这永远都不会发生，但是我从数据源获取的数据包含这种“怪异”。 So if I have lines with the same keys, I want to be able to chose either a combination of the two (like mean value) or a string concatenation (labels for example) or a random values set. 因此，如果我的行具有相同的键，则希望能够选择两者的组合（例如平均值）或字符串连接（例如标签）或设置随机值。

Say my DataFrame df looks like this: 说我的DataFrame df看起来像这样：

+---+----+-----------+---------+
|ID1| ID2|       VAL1|     VAL2|
+---+----+-----------+---------+
|  A|   U|     PIERRE|        1|
|  A|   U|     THOMAS|        2|
|  A|   U|    MICHAEL|        3|
|  A|   V|        TOM|        2|
|  A|   V|       JACK|        3|
|  A|   W|     MICHEL|        2|
|  A|   W|     JULIEN|        3|
+---+----+-----------+---------+

I want my final DataFrame out to only keep one set of values per key, randomly. 我希望我的最后数据帧out到只保留一组每个键值，随机。 It could be another type of aggregation (say the concatenation of all values as a string) but I just don't want to build an Integer value from it, rather build new entries. 这可能是另一种聚合（例如，将所有值串联为字符串），但是我只是不想从中构建一个Integer值，而是构建新的条目。

Eg. 例如。 a final output could be (keeping only the first row per key): 最终输出可能是（每个键仅保留第一行）：

+---+----+-----------+---------+
|ID1| ID2|       VAL1|     VAL2|
+---+----+-----------+---------+
|  A|   U|     PIERRE|        1|
|  A|   V|        TOM|        2|
|  A|   W|     MICHEL|        2|
+---+----+-----------+---------+

Another final output could be (keeping a random row per key): 另一个最终输出可能是（每个键保持随机行）：

+---+----+-----------+---------+
|ID1| ID2|       VAL1|     VAL2|
+---+----+-----------+---------+
|  A|   U|    MICHAEL|        3|
|  A|   V|       JACK|        3|
|  A|   W|     MICHEL|        2|
+---+----+-----------+---------+

Or, building a new set of values: 或者，建立一组新的值：

+---+----+--------------------------+----------+
|ID1| ID2|                      VAL1|      VAL2|
+---+----+--------------------------+----------+
|  A|   U| (PIERRE, THOMAS, MICHAEL)| (1, 2, 3)|
|  A|   V|               (TOM, JACK)|    (2, 3)|
|  A|   W|          (MICHEL, JULIEN)|    (2, 3)|
+---+----+--------------------------+----------+

The answer should use Spark with Scala. 答案应将Spark与Scala结合使用。 I also want to underline that the actual schema is way more complicated than that and I would like to reach a generic solution. 我还想强调一下，实际的架构要比这复杂得多，我想找到一个通用的解决方案。 Also, I do not want to fetch only unique values from one column but filter out lines that have same keys. 另外，我不想从一列取唯一的值，但筛选出具有相同的键线。 Thanks! 谢谢！

EDIT This is what I tried to do (but Row.get(colname) throws a NoSuchElementException: key not found... ): 编辑这是我试图做的（但是Row.get(colname)抛出NoSuchElementException: key not found... ）：

  def myDropDuplicatesRandom(df: DataFrame, colnames: Seq[String]): DataFrame = {
    val fields_map: Map[String, (Int, DataType)] =
      df.schema.fieldNames.map(fname => {
        val findex = df.schema.fieldIndex(fname)
        val ftype = df.schema.fields(findex).dataType
        (fname, (findex, ftype))
      }).toMap[String, (Int, DataType)]

    df.sparkSession.createDataFrame(
      df.rdd
        .map[(String, Row)](r => (colnames.map(colname => r.get(fields_map(colname)._1).toString.replace("`", "")).reduceLeft((x, y) => "" + x + y), r))
        .groupByKey()
        .map{case (x: String, y: Iterable[Row]) => Utils.randomElement(y)}
    , df.schema)
  }

Answer 1

Here's one approach: 这是一种方法：

val df = Seq(
  ("A", "U", "PIERRE", 1),
  ("A", "U", "THOMAS", 2),
  ("A", "U", "MICHAEL", 3),
  ("A", "V", "TOM", 2),
  ("A", "V", "JACK", 3),
  ("A", "W", "MICHEL", 2),
  ("A", "W", "JULIEN", 3)
).toDF("ID1", "ID2", "VAL1", "VAL2")

import org.apache.spark.sql.functions._

// Gather key/value column lists based on specific filtering criteria
val keyCols = df.columns.filter(_.startsWith("ID"))
val valCols = df.columns diff keyCols

// Group by keys to aggregate combined value-columns then re-expand
df.groupBy(keyCols.map(col): _*).
  agg(first(struct(valCols.map(col): _*)).as("VALS")).
  select($"ID1", $"ID2", $"VALS.*")

// +---+---+------+----+
// |ID1|ID2|  VAL1|VAL2|
// +---+---+------+----+
// |  A|  W|MICHEL|   2|
// |  A|  V|   TOM|   2|
// |  A|  U|PIERRE|   1|
// +---+---+------+----+

[UPDATE] [UPDATE]

If I understand your expanded requirement correctly, you're looking for a generic way to transform dataframes by keys with an arbitrary agg function, like: 如果我正确地理解了您的扩展需求，那么您正在寻找一种通用方法来通过具有任意agg函数的键来转换数据帧，例如：

import org.apache.spark.sql.Column

def customAgg(keyCols: Seq[String], valCols: Seq[String], aggFcn: Column => Column) = {
  df.groupBy(keyCols.map(col): _*).
    agg(aggFcn(struct(valCols.map(col): _*)).as("VALS")).
    select($"ID1", $"ID2", $"VALS.*")
}

customAgg(keyCols, valCols, first)

I'd say that going down this path would result in very limited applicable agg functions. 我想说，走这条道路将导致适用的agg函数非常有限。 While the above works for first , you would have to implement differently for, say, collect_list/collect_set , etc. One can certainly hand-roll all the various types of agg functions, but it would likely result in unwarranted code maintenance hassle. 虽然上述方法first ，但是您必须为collect_list/collect_set等实现不同的实现。当然可以手动滚动所有各种类型的agg函数，但这可能会导致不必要的代码维护麻烦。

Answer 2

You can use groupBy with first and struct as below 您可以将groupBy与first和struct ，如下所示

  import org.apache.spark.sql.functions._

  val d1 = spark.sparkContext.parallelize(Seq(
    ("A", "U", "PIERRE", 1),
    ("A", "U", "THOMAS", 2),
    ("A", "U", "MICHAEL", 3),
    ("A", "V", "TOM", 2),
    ("A", "V", "JACK", 3),
    ("A", "W", "MICHEL", 2),
    ("A", "W", "JULIEN", 3)
  )).toDF("ID1", "ID2", "VAL1", "VAL2")


  d1.groupBy("ID1", "ID2").agg(first(struct("VAL1", "VAL2")).as("val"))
    .select("ID1", "ID2", "val.*")
    .show(false)

UPDATE: If you have keys and values as a parameter then you can use as below. 更新：如果您将键和值作为参数，则可以如下使用。

val keys = Seq("ID1", "ID2")

val values = Seq("VAL1", "VAL2")

d1.groupBy(keys.head, keys.tail : _*)
    .agg(first(struct(values.head, values.tail:_*)).as("val"))
    .select( "val.*",keys:_*)
    .show(false)

Output: 输出：

+---+---+------+----+
|ID1|ID2|VAL1  |VAL2|
+---+---+------+----+
|A  |W  |MICHEL|2   |
|A  |V  |TOM   |2   |
|A  |U  |PIERRE|1   |
+---+---+------+----+

I hope this helps! 我希望这有帮助！

自制DataFrame聚合/ dropDuplicates Spark

问题描述

2 个解决方案

解决方案1
1 2018-03-07 17:11:34

解决方案2
0 2018-03-07 16:54:18

自制DataFrame聚合/ dropDuplicates Spark

问题描述

2 个解决方案

解决方案1 1 2018-03-07 17:11:34

解决方案2 0 2018-03-07 16:54:18

解决方案1
1 2018-03-07 17:11:34

解决方案2
0 2018-03-07 16:54:18