[英]Homemade DataFrame aggregation/dropDuplicates Spark
I want to perform a transformation on my DataFrame df
so that I only have each key once and only once in the final DataFrame. 我想在DataFrame
df
上执行转换,以便每个键只有一次,而在最终DataFrame中只有一次。
For machine learning purposes, I don't want to have a bias in my dataset. 出于机器学习的目的,我不想在数据集中出现偏见。 This should never occur, but the data I get from my data source contains this "weirdness".
这永远都不会发生,但是我从数据源获取的数据包含这种“怪异”。 So if I have lines with the same keys, I want to be able to chose either a combination of the two (like mean value) or a string concatenation (labels for example) or a random values set.
因此,如果我的行具有相同的键,则希望能够选择两者的组合(例如平均值)或字符串连接(例如标签)或设置随机值。
Say my DataFrame df
looks like this: 说我的DataFrame
df
看起来像这样:
+---+----+-----------+---------+
|ID1| ID2| VAL1| VAL2|
+---+----+-----------+---------+
| A| U| PIERRE| 1|
| A| U| THOMAS| 2|
| A| U| MICHAEL| 3|
| A| V| TOM| 2|
| A| V| JACK| 3|
| A| W| MICHEL| 2|
| A| W| JULIEN| 3|
+---+----+-----------+---------+
I want my final DataFrame out
to only keep one set of values per key, randomly. 我希望我的最后数据帧
out
到只保留一组每个键值,随机。 It could be another type of aggregation (say the concatenation of all values as a string) but I just don't want to build an Integer value from it, rather build new entries. 这可能是另一种聚合(例如,将所有值串联为字符串),但是我只是不想从中构建一个Integer值,而是构建新的条目。
Eg. 例如。 a final output could be (keeping only the first row per key):
最终输出可能是(每个键仅保留第一行):
+---+----+-----------+---------+
|ID1| ID2| VAL1| VAL2|
+---+----+-----------+---------+
| A| U| PIERRE| 1|
| A| V| TOM| 2|
| A| W| MICHEL| 2|
+---+----+-----------+---------+
Another final output could be (keeping a random row per key): 另一个最终输出可能是(每个键保持随机行):
+---+----+-----------+---------+
|ID1| ID2| VAL1| VAL2|
+---+----+-----------+---------+
| A| U| MICHAEL| 3|
| A| V| JACK| 3|
| A| W| MICHEL| 2|
+---+----+-----------+---------+
Or, building a new set of values: 或者,建立一组新的值:
+---+----+--------------------------+----------+
|ID1| ID2| VAL1| VAL2|
+---+----+--------------------------+----------+
| A| U| (PIERRE, THOMAS, MICHAEL)| (1, 2, 3)|
| A| V| (TOM, JACK)| (2, 3)|
| A| W| (MICHEL, JULIEN)| (2, 3)|
+---+----+--------------------------+----------+
The answer should use Spark with Scala. 答案应将Spark与Scala结合使用。 I also want to underline that the actual schema is way more complicated than that and I would like to reach a generic solution.
我还想强调一下,实际的架构要比这复杂得多,我想找到一个通用的解决方案。 Also, I do not want to fetch only unique values from one column but filter out lines that have same keys.
另外,我不想从一列取唯一的值,但筛选出具有相同的键线。 Thanks!
谢谢!
EDIT This is what I tried to do (but Row.get(colname)
throws a NoSuchElementException: key not found...
): 编辑这是我试图做的(但是
Row.get(colname)
抛出NoSuchElementException: key not found...
):
def myDropDuplicatesRandom(df: DataFrame, colnames: Seq[String]): DataFrame = {
val fields_map: Map[String, (Int, DataType)] =
df.schema.fieldNames.map(fname => {
val findex = df.schema.fieldIndex(fname)
val ftype = df.schema.fields(findex).dataType
(fname, (findex, ftype))
}).toMap[String, (Int, DataType)]
df.sparkSession.createDataFrame(
df.rdd
.map[(String, Row)](r => (colnames.map(colname => r.get(fields_map(colname)._1).toString.replace("`", "")).reduceLeft((x, y) => "" + x + y), r))
.groupByKey()
.map{case (x: String, y: Iterable[Row]) => Utils.randomElement(y)}
, df.schema)
}
Here's one approach: 这是一种方法:
val df = Seq(
("A", "U", "PIERRE", 1),
("A", "U", "THOMAS", 2),
("A", "U", "MICHAEL", 3),
("A", "V", "TOM", 2),
("A", "V", "JACK", 3),
("A", "W", "MICHEL", 2),
("A", "W", "JULIEN", 3)
).toDF("ID1", "ID2", "VAL1", "VAL2")
import org.apache.spark.sql.functions._
// Gather key/value column lists based on specific filtering criteria
val keyCols = df.columns.filter(_.startsWith("ID"))
val valCols = df.columns diff keyCols
// Group by keys to aggregate combined value-columns then re-expand
df.groupBy(keyCols.map(col): _*).
agg(first(struct(valCols.map(col): _*)).as("VALS")).
select($"ID1", $"ID2", $"VALS.*")
// +---+---+------+----+
// |ID1|ID2| VAL1|VAL2|
// +---+---+------+----+
// | A| W|MICHEL| 2|
// | A| V| TOM| 2|
// | A| U|PIERRE| 1|
// +---+---+------+----+
[UPDATE] [UPDATE]
If I understand your expanded requirement correctly, you're looking for a generic way to transform dataframes by keys with an arbitrary agg
function, like: 如果我正确地理解了您的扩展需求,那么您正在寻找一种通用方法来通过具有任意
agg
函数的键来转换数据帧,例如:
import org.apache.spark.sql.Column
def customAgg(keyCols: Seq[String], valCols: Seq[String], aggFcn: Column => Column) = {
df.groupBy(keyCols.map(col): _*).
agg(aggFcn(struct(valCols.map(col): _*)).as("VALS")).
select($"ID1", $"ID2", $"VALS.*")
}
customAgg(keyCols, valCols, first)
I'd say that going down this path would result in very limited applicable agg
functions. 我想说,走这条道路将导致适用的
agg
函数非常有限。 While the above works for first
, you would have to implement differently for, say, collect_list/collect_set
, etc. One can certainly hand-roll all the various types of agg
functions, but it would likely result in unwarranted code maintenance hassle. 虽然上述方法
first
,但是您必须为collect_list/collect_set
等实现不同的实现。当然可以手动滚动所有各种类型的agg
函数,但这可能会导致不必要的代码维护麻烦。
You can use groupBy
with first
and struct
as below 您可以将
groupBy
与first
和struct
,如下所示
import org.apache.spark.sql.functions._
val d1 = spark.sparkContext.parallelize(Seq(
("A", "U", "PIERRE", 1),
("A", "U", "THOMAS", 2),
("A", "U", "MICHAEL", 3),
("A", "V", "TOM", 2),
("A", "V", "JACK", 3),
("A", "W", "MICHEL", 2),
("A", "W", "JULIEN", 3)
)).toDF("ID1", "ID2", "VAL1", "VAL2")
d1.groupBy("ID1", "ID2").agg(first(struct("VAL1", "VAL2")).as("val"))
.select("ID1", "ID2", "val.*")
.show(false)
UPDATE: If you have keys and values as a parameter then you can use as below. 更新:如果您将键和值作为参数,则可以如下使用。
val keys = Seq("ID1", "ID2")
val values = Seq("VAL1", "VAL2")
d1.groupBy(keys.head, keys.tail : _*)
.agg(first(struct(values.head, values.tail:_*)).as("val"))
.select( "val.*",keys:_*)
.show(false)
Output: 输出:
+---+---+------+----+
|ID1|ID2|VAL1 |VAL2|
+---+---+------+----+
|A |W |MICHEL|2 |
|A |V |TOM |2 |
|A |U |PIERRE|1 |
+---+---+------+----+
I hope this helps! 我希望这有帮助!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.