如何对JavaRDD进行排序 <Row> 通过多个字段并仅保留Java Spark中的特定数据

Question

I have an input data with type JavaRDD<Row> . 我有一个JavaRDD<Row>类型的输入数据。 The Row has two fields. Row有两个字段。

[
  {"fieldName":"requestId", "fieldType":"String"}, 
  {"fieldName":"price", "fieldType":"double"}
]

The requestId and price could be duplicated in many Rows . requestId和price可以在许多Rows重复。 My purpose is to just reserve the Row with max price from those Rows with the same requestId . 我的目的是只保留Row ，最大price从这些Rows具有相同requestId 。 Actually, any methods will be ok even not use sort. 实际上，即使不使用排序，任何方法都可以。

For example, the input is like this: 例如，输入是这样的：

76044601-8029-4e09-9708-41dd125ae4bb    1676.304091136485
76044601-8029-4e09-9708-41dd125ae4bb    3898.9987591932413
ad0acb4a-100d-4624-b863-fcf275ce28db    7518.603722172683
76044601-8029-4e09-9708-41dd125ae4bb    3308.4421575701463
26f639bc-2041-435c-86da-73b997c0cc64    1737.7186292370193
beeb7fc1-2a2d-4943-8237-c281ee7c9617    4941.882928279789
26f639bc-2041-435c-86da-73b997c0cc64    1710.328581775302

The output data should be like this(the output order is not a problem): 输出数据应如下所示（输出顺序没有问题）：

76044601-8029-4e09-9708-41dd125ae4bb    3898.9987591932413
ad0acb4a-100d-4624-b863-fcf275ce28db    7518.603722172683
26f639bc-2041-435c-86da-73b997c0cc64    1737.7186292370193
beeb7fc1-2a2d-4943-8237-c281ee7c9617    4941.882928279789

Candidate method: 候选方法：

JavaRDD<Row> javaRDD = dataFrame.toJavaRDD().mapToPair(new PairFunction<Row, String, Row>() {
        @Override
        public Tuple2<String, Row> call(Row row) {
            String key = String.valueOf(row.getAs("requestid"));
            return new Tuple2<String, Row>(key, row);
        }
    }).reduceByKey(new Function2<Row, Row, Row>() {
        @Override
        public Row call(Row row1, Row row2) throws Exception {
            double rs1 = Double.parseDouble(String.valueOf(row1.getAs("price")));
            double rs2 = Double.parseDouble(String.valueOf(row2.getAs("price")));
            if (rs1 < rs2) {
                return row2;
            } else {
                return row1;
            }
        }
    }).map(new Function<Tuple2<String, Row>, Row>() {
        @Override
        public Row call(Tuple2<String, Row> tuple) {
            return tuple._2;
        }
    });

Answer 1

First, you must make raw data to JavaRDD object. 首先，必须将原始数据制作为JavaRDD对象。

And with mapToPair function, make data format as key-value type.(key : requestId, value: price) 并使用mapToPair函数将数据格式设置为键值类型。（键：requestId，值：price）

And with reduceByKey function, choice max price as a value of key. 并通过reduceByKey函数选择最高价格作为键的值。

then the result JavaRDD is you want to expect. 那么结果JavaRDD是您想要的。

Answer 2

您应该使用groupByKey，而不是reduceByKey，然后对groupby结果进行排序。

Answer 3

There is a simple way to achieve this. 有一个简单的方法可以实现此目的。

Just use groupBy and then max , you will get the result without parsing to JavaRDD . 只需使用groupBy然后再使用max ，就可以得到结果而无需解析为JavaRDD 。

df.groupBy("requestId").max("price").show();

Test 测试

For input: 输入：

{"requestId": "1", "price": 10}
{"requestId": "1", "price": 15}
{"requestId": "1", "price": 19}
{"requestId": "2", "price": 20}
{"requestId": "2", "price": 21}
{"requestId": "2", "price": 26}
{"requestId": "3", "price": 30}
{"requestId": "3", "price": 38}

I've got: 我有：

+---------+----------+
|requestId|max(price)|
+---------+----------+
|        1|        19|
|        2|        26|
|        3|        38|
+---------+----------+

如何对JavaRDD进行排序 <Row> 通过多个字段并仅保留Java Spark中的特定数据

问题描述

3 个解决方案

解决方案1
0 已采纳 2016-07-15 09:14:54

解决方案2
0 2016-07-15 09:54:59

解决方案3
0 2016-07-15 12:18:24

如何对JavaRDD进行排序 <Row> 通过多个字段并仅保留Java Spark中的特定数据

问题描述

3 个解决方案

解决方案1 0 已采纳 2016-07-15 09:14:54

解决方案2 0 2016-07-15 09:54:59

解决方案3 0 2016-07-15 12:18:24

解决方案1
0 已采纳 2016-07-15 09:14:54

解决方案2
0 2016-07-15 09:54:59

解决方案3
0 2016-07-15 12:18:24