简体   繁体   English

比较大量字符串的最有效算法是什么?

[英]What is the most effective algorithm to compare a large number of strings?

Let's take 2 string arraylists 让我们以2个字符串数组列表

List<String> namesListA = new ArrayList<>(/*50 000 strings*/);
List<String> namesListB = new ArrayList<>(/*400 000 strings*/);

removeAll method seems not working. removeAll方法似乎不起作用。 After: 后:

namesListA.removeAll(namesListB);

namesListA.size() is still 50000. Edit: Input data was incorrect, it actually works but takes a long lime. namesListA.size()仍为50000。编辑:输入数据不正确,它实际上有效,但需要花费很长的时间。

I wrote the following brute-force code: 我编写了以下蛮力代码:

boolean match;
    for (String stringA: namesListA)
    {
        match = false;
        for (String stringB: namesListB)
        {
            if (stringA.equals(stringB))
            {
                match = true;
                break;
            }
        }
        if (!match)
        {
            finallist.add(stringA);
        }
    }

But it takes 8 hours to perform. 但是执行需要8个小时。 it there any known effective algorithm for searching strings? 有没有已知的有效搜索字符串的算法? Like to sort strings in alphabetical order and then search letter by letter or something like this. 喜欢按字母顺序对字符串进行排序,然后按字母或类似的字母进行搜索。

You could put elements of list namesListB into a new Set (preferably HashSet ). 您可以将列表namesListB元素放入新的Set (最好是HashSet )。 Then it is much more effective to call namesListA.removeAll(setFromListB); 然后,更有效地调用namesListA.removeAll(setFromListB); , since the implementation of ArrayList.removeAll calls Collection.contains() which is much more effective in a Set ( HashSet ) than in an ArrayList ( HashSet.contains() has constant time performance, while ArrayList.contains() has linear performance). ,因为ArrayList.removeAll的实现会调用Collection.contains() ,该方法在SetHashSet )中比在ArrayList更有效( HashSet.contains()具有恒定的时间性能,而ArrayList.contains()具有线性性能) 。

Anyway, namesListA.removeAll(namesListB); 无论如何, namesListA.removeAll(namesListB); should work, if namesListA doesn't change, then the 2 lists have no elements in common. 应该可以正常工作,如果namesListA不变,则这两个列表没有共同的元素。

Estimation of time complexity ( N = namesListA.length , M = namesListB.length ): 时间复杂度的估算( N = namesListA.lengthM = namesListB.length ):
Creating the HashSet from namesListB : O(M) namesListB创建HashSetO(M)
Calling namesListA.removeAll(setListB) : O(N * 1) = O(N) 调用namesListA.removeAll(setListB)O(N * 1)= O(N)
In total: O(M + N) (which could be written as O(M) since M>N, but I'm not sure) 总计: O(M + N) (由于M> N可以写为O(M),但我不确定)

Create a set for the 400 000 names in namesListB . namesListB为400 000个名称创建一个集合。 Then use this set to remove the undesired elements of namesListA . 然后,使用此集合删除namesListA的不需要的元素。

List<String> namesListA = new ArrayList<>(/*50 000 strings*/);
List<String> namesListB = new ArrayList<>(/*400 000 strings*/);

Set<String> undesiredNames = new HashSet<>(namesListB);

for (String name : namesListA) {
    if (undesiredNames.contains(name)) {
        namesListA.remove(name);
    }
}

One possibility would be to parallelize the removal. 一种可能性是并行进行移除。 The lists namesListA and namesListB can be grouped by starting character; 列表namesListAnamesListB可以按起始字符分组; then the removal could be done group-wise in parallel and the resulting lists could be concatenated again. 那么可以并行地逐组完成删除操作,然后可以再次合并结果列表。

Assuming some standard Latin alphabet, this would result in roughly 26 groups which could be processed in parallel. 假设一些标准的拉丁字母,这将导致大约26个可以并行处理的组。 If 4 threads can be run in parallel, I would expect a significant speedup. 如果可以并行运行4个线程,我希望可以大幅度提高速度。

I would recommend to use an HashSet instead of a List to store the String s of the biggest collection in order to know whether the collection contains or not a given String with a time complexity of O(1) instead of O(n) , then use removeAll(Collection<?> c) to keep only the String s that are not in the second collection as next: 我建议使用HashSet而不是List来存储最大集合的String ,以便知道该集合是否包含时间复杂度为O(1)而不是O(n)的给定String ,然后使用removeAll(Collection<?> c)仅将不在第二个集合中的String保留为下一个:

List<String> namesListA = new ArrayList<>(/*50 000 strings*/);
Set<String> namesSetB = new HashSet<>(/*400 000 strings*/);
namesListA.removeAll(namesSetB);

Here's a solution in O(n*logn) . 这是O(n * logn)中的解决方案。 Should be faster than the approaches posted yet. 应该比尚未发布的方法更快 Edit: If you don't need the exact element, my other approach is faster. 编辑:如果您不需要确切的元素,我的另一种方法是更快。

1.) Sort both lists 1.)对两个列表进行排序

Use Collections.sort(...) for efficient sorting in O(n*logn). 使用Collections.sort(...)在O(n * logn)中进行有效排序。

2.) Compare with two iterators 2.)与两个迭代器比较

Fetch two iterators over the two lists. 在两个列表中获取两个迭代器。 Then: 然后:

while(leftIterator.hasNext() && rightIterator.hasNext(){
    int comparisonResult = leftElement.compare(rightElement);
    if (comparisonResult == -1){
        leftElement = leftIterator.next();
    }
    else if (comparisonResult == 1){
        rightElement = rightIterator.next();
    }
    else{
        // found it!
        return true;
    }
}

(Sorry if I mistyped, don't have an IDE at my hand) (对不起,如果我输入错了,手头没有IDE)

=> Sorting is in O(i logi + j logj)) =>排序在O(i logi + j logj)中

=> Comparison is in O(i+j) =>比较在O(i + j)中


Result performance is efficiently in class O(n*logn) . 结果性能在类O(n * logn)中有效 This should work nicely. 这应该很好地工作。

考虑到您同时拥有50k和400k大小的列表,所以对removeAll进行列表可能是更好的解决方案

namesListA.removeAll(namesListB);

If it's not important which element is duplicate but only if there is any you can let the Collections do the trick for you. 如果不重要的是重复哪个元素,但只有有重复元素的情况下,才可以让Collections为您完成窍门。

int sizeA = listA.size();
int sizeB = listB.size();

Set merger = new HashSet((sizeA+sizeB)*someLoadFactor);
merger.addAll(listA);
merger.addAll(listB);
// Sets do not contain duplicates!

if (merger.size() < sizeA + sizeB){
    return true;
}
return false;

This runs in O(i+j) so efficiently O(n) ! 这在O(i + j)中运行非常有效O(n)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM