在spark中过滤KeyValueGrouped数据集

Question

我有一个自定义类的类型化数据集，并在其上使用groupbykey方法。 您知道它会产生一个KeyValueGroupedDataset。 我想过滤这个新的数据集，但是这种类型的数据集没有过滤方法。 因此，我的问题是：如何过滤这种类型的数据集？ （需要Java解决方案。spark版本：2.3.1）。

样本数据：

"id":1,"fname":"Gale","lname":"Willmett","email":"gwillmett0@nhs.uk","gender":"Female"
"id":2,"fname":"Chantalle","lname":"Wilcher","email":"cwilcher1@blinklist.com","gender":"Female"
"id":3,"fname":"Polly","lname":"Grandisson","email":"pgrandisson2@linkedin.com","gender":"Female"
"id":3,"fname":"Moshe","lname":"Pink","email":"mpink3@twitter.com","gender":"Male"
"id":2,"fname":"Yorke","lname":"Ginnelly","email":"yginnelly4@apple.com","gender":"Male"

我做了什么：

    Dataset<Person> peopleDS = spark.read().format("parquet").load("\path").as(Encoders.bean(Person.class));
    KeyValueGroupedDataset<String, Person> KVDS = peopleDS.groupByKey( (MapFunction<Person, String> ) f -> f.getGender() , Encoders.STRING());
//How Can I filter on KVDS's id field?

UPDATE1（使用flatMapGroups的）：

Dataset<Person> persons = KVDS.flatMapGroups((FlatMapGroupsFunction <String,Person,Person>) (f,k) -> (Iterator<Person>) k ,  Encoders.bean(Person.class));

Update2 （使用MapGroups）

Dataset<Person> peopleMap = KVDS.mapGroups((MapGroupsFunction <String,Person,Person>) (f,g) -> {
        while (g.hasNext()) {
        //What can I do here?       
    }
},Encoders.bean(Person.Class);

Update3 ：我想过滤那些ID大于1的组，例如下图所示：我只想选择女性组，因为ID的大于1（第一个字段是ID。其他是fname，lname，电子邮件和性别）。

Update4：我用“ RDD”做了我想要的，但是我想用“ Dataset”来做这部分代码：

List<Tuple2<String, Iterable<Person>>> f = PersonRDD
        .mapToPair(s -> new Tuple2<>(s.getGender(), s)).groupByKey()
        .filter(t -> ((Collection<Person>) t._2()).stream().mapToInt(e -> e.getId).distinct().count() > 1)
        .collect();

Answer 1

分组用于聚合函数，您可以在“ KeyValueGroupedDataset”类中找到诸如“ agg”之类的函数。 如果您对ex应用汇总功能。 “计数”，您将获得“数据集”，并且“过滤器”功能将可用。

例如，没有聚合功能的“ groupBy”看起来很奇怪，其他功能很奇怪。 可以使用“ distinct”。

使用“ FlatMapGroupsFunction”的过滤示例：

                .flatMapGroups(
                    (FlatMapGroupsFunction<String, Person, Person>) (f, k) -> {
                        List<Person> result = new ArrayList<>();
                        while (k.hasNext()) {
                            Person value = k.next();
                            // filter condition here
                            if (value != null) {
                                result.add(value);
                            }
                        }
                        return result.iterator();
                    },
                    Encoders.bean(Person.class))

Answer 2

为什么在分组之前不对id进行过滤？ GroupByKey是一项昂贵的操作，应该首先进行过滤才能更快。

如果您确实要先进行分组，则可能必须使用具有身份功能的.flatMapGroups。

不确定Java代码，但scala版本如下：

peopleDS
.groupByKey(_.gender)
.mapGroups { case (gender, persons) => persons.filter(your condition) }

但是同样，您应该先过滤:)。 特别是因为您的ID字段在分组之前已经可用。

在spark中过滤KeyValueGrouped数据集

问题描述

2 个解决方案

解决方案1
0 2018-10-01 13:36:28

解决方案2
0 2018-10-01 14:31:29

在spark中过滤KeyValueGrouped数据集

问题描述

2 个解决方案

解决方案1 0 2018-10-01 13:36:28

解决方案2 0 2018-10-01 14:31:29

解决方案1
0 2018-10-01 13:36:28

解决方案2
0 2018-10-01 14:31:29