简体   繁体   English

如何比较两个庞大的列表 <String> 在Java中?

[英]How to compare two huge List<String> in Java?

My application generates 2 big lists (up to 3.5mill string records). 我的应用程序生成2个大列表(最多3.5mill个字符串记录)。 I need the best and fastest way to compare it. 我需要最好,最快的方法进行比较。 Currently I am doing it like this: 目前,我正在这样做:

List list1 = ListUtils.subtract(sourceDbResults, hiveResults);
List list2 = ListUtils.subtract(hiveResults, sourceDbResults);

But this method is really expensive on memory as i see from jconsole and sometimes process even stack on it. 但是,正如我从jconsole看到的那样,此方法在内存上确实非常昂贵,有时甚至可以在其上处理堆栈。 Any good solutions or ideas? 有什么好的解决方案或想法吗?

Element positions/order in the list are always the same, so I dont need to deal with it. 列表中的元素位置/顺序总是相同的,所以我不需要处理它。 After comparing I need to know if the list are the same and to get the differences from these list if they are not the same. 比较之后,我需要知道列表是否相同,并从这些列表中获取差异(如果它们不相同)。 Subtract works perfect for small lists. 减法非常适合小清单。

Given that you've said your two lists are already sorted, they can be compared in O(N) time, which is much faster than your current solution that uses ListUtils. 假设您已经说过两个列表已经排序,则可以O(N)时间对它们进行比较,这比当前使用ListUtils的解决方案要快得多。 The following method does this using a similar algorithm to the one that merges two sorted lists that can be found in most textbooks. 下面的方法使用一种与合并大多数教科书中可以找到的两个排序列表的算法相似的算法来执行此操作。

import java.util.*;

public class CompareSortedLists {
    public static void main(String[] args) {
        List<Integer> sourceDbResults = Arrays.asList(1, 2, 3, 4, 5, 8);
        List<Integer> hiveResults = Arrays.asList(2, 3, 6, 7);
        List<Integer> inSourceDb_notInHive = new ArrayList<>();
        List<Integer> inHive_notInSourceDb = new ArrayList<>();

        compareSortedLists(
                sourceDbResults, hiveResults,
                inSourceDb_notInHive, inHive_notInSourceDb);

        assert inSourceDb_notInHive.equals(Arrays.asList(1, 4, 5, 8));
        assert inHive_notInSourceDb.equals(Arrays.asList(6, 7));
    }

    /**
     * Compares two sorted lists (or other iterable collections in ascending order).
     * Adds to onlyInList1 any and all elements in list1 that are not in list2; and
     * conversely to onlyInList2. The caller must ensure the two input lists are
     * already sorted and should initialize onlyInList1 and onlyInList2 to empty,
     * writable collections.
     */
    public static <T extends Comparable<? super T>> void compareSortedLists(
            Iterable<T> list1, Iterable<T> list2,
            Collection<T> onlyInList1, Collection<T> onlyInList2) {
        Iterator<T> it1 = list1.iterator();
        Iterator<T> it2 = list2.iterator();
        T e1 = it1.hasNext() ? it1.next() : null;
        T e2 = it2.hasNext() ? it2.next() : null;
        while (e1 != null || e2 != null) {
            if (e2 == null) {  // No more elements in list2, some remaining in list1
                onlyInList1.add(e1);
                e1 = it1.hasNext() ? it1.next() : null;
            }
            else if (e1 == null) {  // No more elements in list1, some remaining in list2
                onlyInList2.add(e2);
                e2 = it2.hasNext() ? it2.next() : null;
            }
            else {
                int comp = e1.compareTo(e2);
                if (comp < 0) {
                    onlyInList1.add(e1);
                    e1 = it1.hasNext() ? it1.next() : null;
                }
                else if (comp > 0) {
                    onlyInList2.add(e2);
                    e2 = it2.hasNext() ? it2.next() : null;
                }
                else /* comp == 0 */ {
                    e1 = it1.hasNext() ? it1.next() : null;
                    e2 = it2.hasNext() ? it2.next() : null;
                }
            }
        }
    }
}

The above method uses no external libraries, and can be used with any version of Java from 6 upwards. 上面的方法不使用任何外部库,并且可以与6以上版本的任何Java版本一起使用。 If you use a PeekingIterator, such as the one from Apache Commons Collections, or Guava, or write your own, then you can make the method simpler, especially if you also use Java 8: 如果使用PeekingIterator(例如来自Apache Commons Collections或Guava的PeekingIterator)或编写自己的方法,则可以使方法更简单,尤其是在还使用Java 8的情况下:

public static <T extends Comparable<? super T>> void compareSortedLists(
        Iterable<T> list1, Iterable<T> list2,
        Collection<T> onlyInList1, Collection<T> onlyInList2) {
    PeekingIterator<T> it1 = new PeekingIterator<>(list1.iterator());
    PeekingIterator<T> it2 = new PeekingIterator<>(list2.iterator());
    while (it1.hasNext() && it2.hasNext()) {
        int comp = it1.peek().compareTo(it2.peek());
        if (comp < 0)
            onlyInList1.add(it1.next());
        else if (comp > 0)
            onlyInList2.add(it2.next());
        else /* comp == 0 */ {
            it1.next();
            it2.next();
        }
    }
    it1.forEachRemaining(onlyInList1::add);
    it2.forEachRemaining(onlyInList2::add);
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM