在spark中过滤KeyValueGrouped数据集

Question

I have a typed dataset of a custom class and use groupbykey method on it. 我有一个自定义类的类型化数据集，并在其上使用groupbykey方法。 You know that it results a KeyValueGroupedDataset. 您知道它会产生一个KeyValueGroupedDataset。 I want to filter this new dataset but there is no filter method for this type of dataset. 我想过滤这个新的数据集，但是这种类型的数据集没有过滤方法。 So, My question is: How can I filter on this type of dataset? 因此，我的问题是：如何过滤这种类型的数据集？ (Java solution is needed. spark version: 2.3.1). （需要Java解决方案。spark版本：2.3.1）。

sampleData: 样本数据：

"id":1,"fname":"Gale","lname":"Willmett","email":"gwillmett0@nhs.uk","gender":"Female"
"id":2,"fname":"Chantalle","lname":"Wilcher","email":"cwilcher1@blinklist.com","gender":"Female"
"id":3,"fname":"Polly","lname":"Grandisson","email":"pgrandisson2@linkedin.com","gender":"Female"
"id":3,"fname":"Moshe","lname":"Pink","email":"mpink3@twitter.com","gender":"Male"
"id":2,"fname":"Yorke","lname":"Ginnelly","email":"yginnelly4@apple.com","gender":"Male"

And What I did: 我做了什么：

    Dataset<Person> peopleDS = spark.read().format("parquet").load("\path").as(Encoders.bean(Person.class));
    KeyValueGroupedDataset<String, Person> KVDS = peopleDS.groupByKey( (MapFunction<Person, String> ) f -> f.getGender() , Encoders.STRING());
//How Can I filter on KVDS's id field?

Update1 (use of flatMapGroups): UPDATE1（使用flatMapGroups的）：

Dataset<Person> persons = KVDS.flatMapGroups((FlatMapGroupsFunction <String,Person,Person>) (f,k) -> (Iterator<Person>) k ,  Encoders.bean(Person.class));

Update2 (use of MapGroups) Update2 （使用MapGroups）

Dataset<Person> peopleMap = KVDS.mapGroups((MapGroupsFunction <String,Person,Person>) (f,g) -> {
        while (g.hasNext()) {
        //What can I do here?       
    }
},Encoders.bean(Person.Class);

Update3 : I want to filter those groups that distinct of their ids is greater than 1. for example in below picture: I want just Female groups because distinct of their ids is greater that 1 (first field is id. Others are fname,lname,email and gender). Update3 ：我想过滤那些ID大于1的组，例如下图所示：我只想选择女性组，因为ID的大于1（第一个字段是ID。其他是fname，lname，电子邮件和性别）。

Update4: I did What I want with "RDD", but I want to do exactly this part of code with "Dataset": Update4：我用“ RDD”做了我想要的，但是我想用“ Dataset”来做这部分代码：

List<Tuple2<String, Iterable<Person>>> f = PersonRDD
        .mapToPair(s -> new Tuple2<>(s.getGender(), s)).groupByKey()
        .filter(t -> ((Collection<Person>) t._2()).stream().mapToInt(e -> e.getId).distinct().count() > 1)
        .collect();

Answer 1

Grouping is used for aggregation functions, you can find functions like "agg" in "KeyValueGroupedDataset" class. 分组用于聚合函数，您可以在“ KeyValueGroupedDataset”类中找到诸如“ agg”之类的函数。 If you apply aggregation function for ex. 如果您对ex应用汇总功能。 "count", you will get "Dataset", and "filter" function will be available. “计数”，您将获得“数据集”，并且“过滤器”功能将可用。

"groupBy" without aggregation function looks strange, other function, for ex. 例如，没有聚合功能的“ groupBy”看起来很奇怪，其他功能很奇怪。 "distinct" can be used. 可以使用“ distinct”。

Filtering example with "FlatMapGroupsFunction": 使用“ FlatMapGroupsFunction”的过滤示例：

                .flatMapGroups(
                    (FlatMapGroupsFunction<String, Person, Person>) (f, k) -> {
                        List<Person> result = new ArrayList<>();
                        while (k.hasNext()) {
                            Person value = k.next();
                            // filter condition here
                            if (value != null) {
                                result.add(value);
                            }
                        }
                        return result.iterator();
                    },
                    Encoders.bean(Person.class))

Answer 2

Why don't you filter on id before grouping ? 为什么在分组之前不对id进行过滤？ GroupByKey is an expensive action, it should be faster to filter first. GroupByKey是一项昂贵的操作，应该首先进行过滤才能更快。

If you really want to group first, you may have to then use .flatMapGroups with identity function. 如果您确实要先进行分组，则可能必须使用具有身份功能的.flatMapGroups。

Not sure about java code but scala version would be something as follow: 不确定Java代码，但scala版本如下：

peopleDS
.groupByKey(_.gender)
.mapGroups { case (gender, persons) => persons.filter(your condition) }

But again, you should filter first :). 但是同样，您应该先过滤:)。 Specially since your ID field is already available before grouping. 特别是因为您的ID字段在分组之前已经可用。

在spark中过滤KeyValueGrouped数据集

问题描述

2 个解决方案

解决方案1
0 2018-10-01 13:36:28

解决方案2
0 2018-10-01 14:31:29

在spark中过滤KeyValueGrouped数据集

问题描述

2 个解决方案

解决方案1 0 2018-10-01 13:36:28

解决方案2 0 2018-10-01 14:31:29

解决方案1
0 2018-10-01 13:36:28

解决方案2
0 2018-10-01 14:31:29