简体   繁体   English

两个containsDuplicates算法之间的时间复杂度差异

[英]Time complexity difference between two containsDuplicates algorithms

I completed two version of a leetcode algorithm and am wondering if my complexity analysis is correct, even though the online submission time in ms does not show it accurately. 我完成了两个版本的leetcode算法,并且想知道我的复杂度分析是否正确,即使以毫秒为单位的在线提交时间不能准确显示它。 The goal is to take a vector of numbers as a reference and return true if it contains duplicate values and false if it does not. 目标是将数字向量作为参考,如果包含重复值,则返回true,否则返回false。

The two most intuitive approaches are: 两种最直观的方法是:

1.) Sort the vector and do one sweep to the second to last, and see if any neighboring elements are identical and return true if so. 1.)对向量进行排序,一次扫到倒数第二次,看是否有相邻的元素相同,如果相同则返回true。

2.) Use a hashtable and insert the values and if a key already exists in the table, return true. 2)使用哈希表并插入值,如果表中已存在键,则返回true。

I completed the first version first, and it was quick, but seeing as how the sort routine would take O(nlog(n)) and the hash table inserts & map.count() s would make the second version O(log(n) + N) = O(N) I would think the hashing version would be faster with very large data sets. 我首先完成了第一个版本,并且很快,但是看到排序例程将如何使用O(nlog(n)) ,并且哈希表inserts和map.count() s 将使第二个版本成为O(log(n) + N) = O(N)我认为对于非常大的数据集,散列版本会更快。

In the online judging I was proven wrong, however I assumed they weren't using large enough data sets to offset the std::map overhead. 在在线评审中,我被证明是错误的,但是我认为他们没有使用足够大的数据集来抵消std::map开销。 So I ran a lot of tests repeatedly filling vectors up to a size between 0 and 10000 incrementing by 2, adding random values in between 0 and 20000. I piped the output to a csv file and plotted it on linux and here's the image I got. 因此,我进行了很多测试,反复填充矢量,大小在0到10000之间,以2为增量,在0到20000之间添加随机值。我将输出通过管道传输到csv文件,并在linux上作图,这是我得到的图像。

情节

Is the provided image truly showing me the difference here, between an O(N) and an O(nlog(n)) algorithm? 所提供的图像是否真的向我展示了O(N)O(nlog(n))算法之间的区别? I just want to make sure my complexity analysis is correct on these? 我只想确保我的复杂度分析在这些方面是正确的?

Here are the algorithms run: 这是运行的算法:

bool containsDuplicate(vector<int>& nums) {
  if(nums.size() < 2) return false;
  sort(nums.begin(), nums.end());

  for(int i = 0; i < nums.size()-1; ++i) {
    if(nums[i] == nums[i+1]) return true;
  }
  return false;
}
// Slightly slower in small cases because of data structure overhead I presume
bool containsDuplicateWithHashing(vector<int>& nums) {
  map<int, int> map;
  for (int i = 0; i < nums.size(); ++i) {
    if(map.count(nums[i])) return true;
    map.insert({nums[i], i});
  }
  return false;
}

std::map is sorted, and involves O(log n) cost for each insertion and lookup , so the total cost in the "no duplicates" case (or in the "first duplicate near the end of the vector" case) would have similar big-O to sorting and scanning: O(n log n) ; std::map进行了排序,并且每次插入和查找都涉及O(log n)成本 ,因此在“无重复”情况下(或在“向量结尾附近的第一个重复”情况下)的总成本为与排序和扫描类似的big-O: O(n log n) it's typically fragmented in memory, so overhead could easily be higher than that of an optimized std::sort . 它通常分散在内存中,因此开销很容易会比优化的std::sort更高。

It would appear much faster if duplicates were common though; 如果重复很常见,它将显得更快。 if you usually find a duplicate in the first 10 elements, it doesn't matter if the input has 10,000 elements, because the map doesn't have time to grow before you hit a duplicate and duck out. 如果通常在前10个元素中找到重复项,则输入是否包含10,000个元素都没有关系,因为在您找到重复项并退出之前, map没有时间增长。 It's just that a test that only works well when it succeeds is not a very good test for general usage (if duplicates are that common, the test seems a bit silly); 只是只有成功才能通过的测试对于一般用途而言并不是一个很好的测试(如果重复很常见,那么该测试似乎有点愚蠢); you want good performance in both the contains duplicate and doesn't contain duplicate cases. 您希望在包含重复的情况和不包含重复的情况下都具有良好的性能。

If you're looking to compare approaches with meaningfully different algorithmic complexity, try using std::unordered_set to replace your map-based solution ( insert returns whether the key already existed as well, so you reduce work from one lookup followed by one insert to just one combined insert and lookup on each loop), which has average case O(1) insertion and lookup, for O(n) duplicate checking complexity. 如果您要比较算法复杂度明显不同的方法,请尝试使用std::unordered_set替换基于地图的解决方案( insert返回键是否已经存在,因此可以减少一次查找的工作,然后一次插入每个循环只有一个合并的插入和查找),其平均情况为O(1)插入和查找,用于O(n)重复检查的复杂性。

FYI, another approach that would be O(n log n) but use a sort-like strategy that shortcuts when a duplicate is found early, would be to make a heap with std::make_heap ( O(n) work), then repeatedly pop_heap ( O(log n) per pop) from the heap and compare to the heap's .front() ; 仅供参考,另一种方法将是O(n log n)但使用类似排序的策略,该方法会在早期发现重复项时捷径出现,该方法是使用std::make_heapO(n)工作)创建一个堆,然后重复pop_heap (每次弹出操作的O(log n) ),并与堆的.front()进行比较; if the value you just popped and the front are the same, you've got a duplicate and can exit immediately. 如果您刚刚弹出的值和前面的值相同,则表示您有重复项,可以立即退出。 You could also use the priority_queue adapter to simplify this into a single container, instead of manually using the utility functions on a std::vector or the like. 您也可以使用priority_queue适配器将其简化为单个容器,而不是手动在std::vector等工具上使用实用程序功能。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM