Spark数据集groupByKey不起作用（Java）

Question

I try to use Dataset's groupByKey method. 我尝试使用Dataset的groupByKey方法。 I can't figure out the problem and can't find any working example which uses groupByKey . 我无法弄清楚问题， 找不到任何使用groupByKey的工作示例。

So let me point out the points, I am looking for in the solution: 那么，让我指出一点，我在解决方案中寻找：

I want to use groupByKey - there are a lot of example using groupBy("key").agg(function), I know it but don't want to use them (education purposes) 我想使用groupByKey - 有很多使用groupBy（“key”）的例子.agg（函数），我知道它但不想使用它们（教育目的）
I want to use Java - many examples use Scala, again don't want it. 我想使用Java - 很多例子都使用Scala，再次不想要它。
The function should preferably be written as lambda expression. 该函数最好应写为lambda表达式。

Here is what I did: 这是我做的：

//Inner class
public static class Bean implements Serializable {
    private static final long serialVersionUID = 1L;
    private String k;
    private int something;

    public Bean(String name, int value) {
        k = name;
        something = value;
    }

    public String getK() {return k;}
    public int getSomething() {return something;}

    public void setK(String k) {this.k = k;}
    public void setSomething(int something) {this.something = something;}
}

//usage
List<Bean> debugData = new ArrayList<Bean>();
debugData.add(new Bean("Arnold", 18));
debugData.add(new Bean("Bob", 7));
debugData.add(new Bean("Bob", 13));
debugData.add(new Bean("Bob", 15));
debugData.add(new Bean("Alice", 27));
Dataset<Row> df = sqlContext.createDataFrame(debugData, Bean.class);
df.groupByKey(row -> {new Bean(row.getString(0), row.getInt(1));}, Encoders.bean(Bean.class)); //doesn't compile

The error I am getting: 我得到的错误：

Ambiguous method call - The IDE shows warning about both Function1 and MapFunction are matching. 模糊方法调用 - IDE显示有关Function1和MapFunction匹配的警告。
The getString and getInt cannot be resolved 无法解析getString和getInt
I can't show/print the result 我无法显示/打印结果

Answer 1

Using Java 8 lambda 使用Java 8 lambda

df.groupByKey(row -> {
            return new Bean(row.getString(0), row.getInt(1));
        }, Encoders.bean(Bean.class));

Using MapFunction 使用MapFunction

df.groupByKey(new MapFunction<Row, Bean>() {
            @Override
            public Bean call(Row row) throws Exception {
                return new Bean(row.getString(0), row.getInt(1));
            }
        }, Encoders.bean(Bean.class));

Answer 2

This error arises because groupByKey has two overloded implementations. 出现此错误是因为groupByKey有两个overloded实现。 one of these methods gives MapFunction as first argument and the second gives Function1 . 其中一种方法将MapFunction作为第一个参数，第二个给出Function1 。 Your lambda code can cast to both of them. 你的lambda代码可以强制转换为它们。 So you should explicitly declare which one is your intention. 所以你应该明确声明你的意图。 Casting is an easy solution: 铸造是一个简单的解决方案：

df.groupByKey(row -> (MapFunction<Row, Bean>) new Bean(row.getString(0), row.getInt(1))
    , Encoders.bean(Bean.class));

Spark数据集groupByKey不起作用（Java）

问题描述

2 个解决方案

解决方案1
3 已采纳 2017-12-11 16:24:53

解决方案2
0 2018-09-29 05:31:29

Spark数据集groupByKey不起作用（Java）

问题描述

2 个解决方案

解决方案1 3 已采纳 2017-12-11 16:24:53

解决方案2 0 2018-09-29 05:31:29

解决方案1
3 已采纳 2017-12-11 16:24:53

解决方案2
0 2018-09-29 05:31:29