简体   繁体   English

在性能方面实现功能的最佳实现

[英]Optimal implementation of function in terms of performance

I have a list of items and a map that is stores the information about the product and it's items data. 我有一个项目列表和一个地图,该地图存储有关产品及其项目数据的信息。 There are around 150k items in the DB and around 200k products (each product has approximately 1000 to 2000 items that mapped to it). 数据库中大约有15万个商品,大约有20万个产品(每个产品都有大约1000至2000个商品映射到该商品)。

I need a function that counts amount of products each item appears in. This is the function that I have implemented: 我需要一个功能来统计每个项目中出现的产品数量。这是我已经实现的功能:

public Map<Integer, Integer> getProductsNumberForItem(List<Item> itemsList,
        Map<Integer, Map<Item, Integer>> itemsAmount) {
    Map<Integer, Integer> result = new HashMap<>();
    for (Item i : itemsList) {
        int count = 0;
        for (Map<Item, Integer> entry : itemsAmount.values()) {
            if (entry.containsKey(i)) {
                count++;
            }
        }
        result.put(i.getID(), count);
    }
    return result;
}

It works fine on my test DB, which has small amount of data, but when I run it on real data, it takes too much time (for ex.: it has been running already for an hour and still is not finished). 它在包含少量数据的测试数据库上可以正常工作,但是当我在真实数据上运行它时,会花费太多时间(例如:它已经运行了一个小时,但尚未完成)。 From logical point of view its clear, that I am basically performing too many operations, but not sure how can I optimize. 从逻辑的角度来看,很明显,我基本上执行了太多操作,但是不确定如何进行优化。

Any suggestion is appreciated. 任何建议表示赞赏。

You have two ways : 您有两种方法:

  • most efficient : do the computation in a query executed in the database. 最有效:在数据库中执行的查询中进行计算。
    With count() aggregate and group by clause, you should get a much better result as the whole processing will be performed by the DBMS that is designed/optimized to do it. 使用count()聚合和group by子句,您应该获得更好的结果,因为整个处理将由经过设计/优化的DBMS执行。

  • less efficient but you may give it a try: retrieve the data as now and use multi-threading. 效率较低,但您可以尝试一下:像现在一样检索数据并使用多线程。
    With Java 8 parallelStream() , you could maybe get an acceptable result without the hassle to handle synchronization yourself. 使用Java 8 parallelStream() ,您可能会获得可接受的结果,而无需麻烦自己处理同步。

Best option is to delegate this computation to the db, avoiding the need to transfer all data to your application server. 最好的选择是将此计算委托给db,从而避免了将所有数据传输到应用程序服务器的需要。

If this is not an option, then for sure you can improve your current algorithm. 如果这不是一个选择,那么可以肯定的是您可以改进当前的算法。 Right now, for each item on the list, you are looping through all products; 现在,对于列表中的每个项目,您都在浏览所有产品。 that's exponential cost. 那是指数成本。

you could do that (using streams since ressoning is easier to follow in my opinion and also allows for adding some improvements; but same could be achieved without them): 您可以做到这一点(使用流,因为我认为响应更容易遵循,并且还可以添加一些改进;但是没有它们也可以实现):

Stream<Item> productsItemsStream = itemsAmount.values().stream().flatMap(p -> p.keySet().stream());
Map<Item,Long> countByItemFound = productsItemsStream.collect(Collectors.groupingBy(Function.identity(), Collectors.counting());
Map<Integer, Integer> result = itemsList.stream().collect(Collectors.toMap(Item::getID, i -> countByItemFound.getOrDefault(i.getID(), 0L).intValue()));

With this approach you will do one full pass to product items. 通过这种方法,您将对产品进行一次完整的传递。 And then another pass to items list. 然后另一个传递到项目列表。 That's linear cost. 那是线性成本。

Specificto streams, you could give a try to enable parallelism (adding parallelStream to my solution), but it's not completely granted to have big performance increase; 对于流,您可以尝试启用并行性(将并行流添加到我的解决方案中),但是并不能完全保证它具有显着的性能提升。 depends on several factors. 取决于几个因素。 I would wait to see performance on proposed solution and, if needed, profile performance with and without parallelStream in your scenario. 我将等着看建议的解决方案的性能,并在需要时在有无并行流的情况下分析性能。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM