简体   繁体   English

使用值使用 groupbykey 后对 rdd 进行排序

[英]Sorting an rdd after using groupbykey using values

I have JavaPairRDD as我有JavaPairRDD作为

JavaPairRDD<String, Iterable<Row>> rdd = mydataset.orderBy("orderfield1", "orderfield2").javaRDD().mapToPair(row -> new Tuple2<>(row.getAs("id").toString(), row)).groupByKey()

As groupbykey() doesn't maintain order orderby doesn't work here.由于groupbykey()不维护订单, orderby在这里不起作用。 I want to order the Iterable<Row> using some of the fields from dataset.我想使用数据集中的一些字段对Iterable<Row>进行排序。

You could transform the Iterable into a List and then sort that list like below.您可以将Iterable转换为List ,然后像下面那样对该列表进行排序。 I assume that your sorting field is called x and that it is of type String but you can obviously adapt that to your specific case.我假设您的排序字段称为x并且它是 String 类型,但您显然可以根据您的具体情况进行调整。

String sortingField = "x"
JavaPairRDD<String, List<Row>> rdd = mydataset
    .javaRDD()
    .mapToPair(row -> new Tuple2<>(row.getAs("id").toString(), row))
    .groupByKey()
    .mapValues(it -> {
        List<Row> rows = new ArrayList<>();
        it.forEach(rows::add);
        rows.sort(
            (Row a, Row b) -> a.<String>getAs(sortingField).compareTo(b.<String>getAs(sortingField))
        );
        return rows;
    });

Note that this is much simpler to write in scala:请注意,在 scala 中这样写起来要简单得多:

val rdd = mydataset
    .rdd
    .map(row => (row.getAs("id").toString, row))
    .groupByKey
    .mapValues( _.toSeq.sortBy(_.getAs[String]("x")))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM