简体   繁体   English

获取数组列表Java中前k项的最有效方法

[英]Most efficient way to get top k items in an array list Java

I am trying to find the fastest and most efficient way to get the first top K items in arrayList of objects based on a custom compareable implementation.我试图找到最快和最有效的方法来根据自定义可比较实现获取对象数组列表中的前 K 个项目。

during my research some suggested that i should use Max/Min heap which is abstracted in java as PriorityQueue.在我的研究过程中,有人建议我应该使用 Max/Min 堆,它在 Java 中被抽象为 PriorityQueue。 However, the problem is I dont know how to implement that on an arrayList of objects但是,问题是我不知道如何在对象的 arrayList 上实现它

here is my Object instance这是我的对象实例

public class PropertyRecord {

    private long id;
    private String address, firstName, lastName, email, ownerAddress;
    private LocalDate dateSold;
    private BigDecimal price;


    public PropertyRecord(long id, String address, String firstName, String lastName, String email, String ownerAddress, LocalDate dateSold, BigDecimal price) {

        this.id = id;
        this.address = address;
        this.firstName = firstName;
        this.lastName = lastName;
        this.email = email;
        this.ownerAddress = ownerAddress;
        this.dateSold = dateSold;
        this.price = price;

    }
 //getters and setters...
}

i want to get the first top k items based on the price.我想根据价格获得前 k 个项目。 I have written a method (see below) which takes the arrayList and K (get first K items) and used StreamAPI but i know it is not the most efficient way to do it because this will sort the whole list even though I want only the first K items.我已经编写了一个方法(见下文),它采用 arrayList 和 K(获取前 K 个项目)并使用 StreamAPI,但我知道这不是最有效的方法,因为即使我只想要,这也会对整个列表进行排序前K项。 so instead of having an O(n) i want have O(k log n).所以我想要 O(k log n) 而不是 O(n)。

//return the top n properties based on sale price.
    public List<PropertyRecord> getTopProperties(List<PropertyRecord> properties, int n){

       //using StreamAPI
       return properties.stream()
               .sorted((p1, p2) -> p2.getPrice().compareTo(p1.getPrice()))
               .limit(n)
               .collect(Collectors.toList());

    }

Any Help Please?有什么帮助吗?

Guava contains a TopKSelector class that can do exactly this. 番石榴包含一个可以做到这一点的TopKSelector类。

In the latest Guava version, this functionality is now exposed as Comparators.greatest() . 在最新的Guava版本中,此功能现在作为Comparators.greatest()公开。

However, if you're not locked into using an ArrayList for storage, you're probably better off using a PriorityQueue which will naturally keep the elements in priority order. 但是,如果您没有被锁定使用ArrayList进行存储,那么最好使用PriorityQueue更好,因为它将自然地使元素保持优先级顺序。

There are a few possible options to calculate top K in java, so which is the most efficient way?在 Java 中计算前 K 有几种可能的选择,那么哪种方法最有效?

package com.example;

import com.google.common.collect.Ordering;

import java.util.*;
import java.util.stream.Collectors;

public class TopKBenchmark {
    public static void main(String[] args) {
        int inputListSize = 500000;
        int topK = 1000;
        int runCount = 100;
        List<Integer> inputList = new ArrayList<>(inputListSize);
        Random rand = new Random();
        rand.setSeed(System.currentTimeMillis());
        for (int i = 0; i < inputListSize; i++) {
            inputList.add(rand.nextInt(100000));
        }

        List<Integer> result1 = null, result2 = null, result3 = null, result4 = null;

        // method 1: stream and limit
        for (int i = 0; i < runCount; i++) {
            result1 = inputList.stream().sorted().limit(topK).collect(Collectors.toList());
        }

        // method 2: sort all
        for (int i = 0; i < runCount; i++) {
            Collections.sort(inputList);
            result2 = inputList.subList(0, topK);
        }

        // method3: guava: TopKSelector
        Ordering<Integer> ordering = Ordering.natural();
        for (int i = 0; i < runCount; i++) {
            result3 = ordering.leastOf(inputList, topK);
        }

        // method4: PQ
        for (int i = 0; i < runCount; i++) {
            PriorityQueue<Integer> priorityQueue = new PriorityQueue<>(Collections.reverseOrder());
            for (Integer val: inputList) {
                if (priorityQueue.size() < topK || val < priorityQueue.peek()) {
                    priorityQueue.offer(val);
                }
                if (priorityQueue.size() > topK) {
                    priorityQueue.poll();
                }
            }

            result4 = new ArrayList<Integer>(priorityQueue);
            Collections.sort(result4);
        }

        if (result1.size() != result2.size() ||
                result2.size() != result3.size() ||
                result3.size() != result4.size()) {
            throw new RuntimeException();
        }
        for (int i = 0; i < result1.size(); i++) {
            if (!result1.get(i).equals(result2.get(i)) ||
                    !result2.get(i).equals(result3.get(i)) ||
                    !result3.get(i).equals(result4.get(i))) {
                throw new RuntimeException();
            }
        }
    }
}

I tried the following inputListSize and topK combinations:我尝试了以下inputListSizetopK组合:

  • inputListSize=100000, topK=5000 inputListSize=100000, topK=5000
  • 1000, 1000 1000, 1000
  • 5000, 1000 5000, 1000
  • 50000, 1000 50000, 1000
  • 500000, 1000 500000, 1000

Here is the benchmark result (the smaller the better):这是基准测试结果(越小越好):基准测试结果

using Spot Profiler for Java and Kotlin .使用Spot Profiler for Java and Kotlin

NOTE: 1.4s, 23ms, 83ms, 719ms, 8.8s means when given the first, second, ... combination.注意: 1.4s, 23ms, 83ms, 719ms, 8.8s表示给出第一个、第二个……组合。

As Peter mentioned in the comments, this is not a strict benchmark.正如彼得在评论中提到的,这不是一个严格的基准。 It would be best to run the benchmark case by case.最好逐个运行基准测试。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM