简体   繁体   中英

Spark Datasets groupByKey doesn't work (Java)

I try to use Dataset's groupByKey method. I can't figure out the problem and can't find any working example which uses groupByKey .

So let me point out the points, I am looking for in the solution:

  1. I want to use groupByKey - there are a lot of example using groupBy("key").agg(function), I know it but don't want to use them (education purposes)
  2. I want to use Java - many examples use Scala, again don't want it.
  3. The function should preferably be written as lambda expression.

Here is what I did:

//Inner class
public static class Bean implements Serializable {
    private static final long serialVersionUID = 1L;
    private String k;
    private int something;

    public Bean(String name, int value) {
        k = name;
        something = value;
    }

    public String getK() {return k;}
    public int getSomething() {return something;}

    public void setK(String k) {this.k = k;}
    public void setSomething(int something) {this.something = something;}
}

//usage
List<Bean> debugData = new ArrayList<Bean>();
debugData.add(new Bean("Arnold", 18));
debugData.add(new Bean("Bob", 7));
debugData.add(new Bean("Bob", 13));
debugData.add(new Bean("Bob", 15));
debugData.add(new Bean("Alice", 27));
Dataset<Row> df = sqlContext.createDataFrame(debugData, Bean.class);
df.groupByKey(row -> {new Bean(row.getString(0), row.getInt(1));}, Encoders.bean(Bean.class)); //doesn't compile

The error I am getting:

  1. Ambiguous method call - The IDE shows warning about both Function1 and MapFunction are matching.
  2. The getString and getInt cannot be resolved
  3. I can't show/print the result

Using Java 8 lambda

df.groupByKey(row -> {
            return new Bean(row.getString(0), row.getInt(1));
        }, Encoders.bean(Bean.class));

Using MapFunction

df.groupByKey(new MapFunction<Row, Bean>() {
            @Override
            public Bean call(Row row) throws Exception {
                return new Bean(row.getString(0), row.getInt(1));
            }
        }, Encoders.bean(Bean.class));

This error arises because groupByKey has two overloded implementations. one of these methods gives MapFunction as first argument and the second gives Function1 . Your lambda code can cast to both of them. So you should explicitly declare which one is your intention. Casting is an easy solution:

df.groupByKey(row -> (MapFunction<Row, Bean>) new Bean(row.getString(0), row.getInt(1))
    , Encoders.bean(Bean.class));

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM