简体   繁体   中英

Time complexity difference between two containsDuplicates algorithms

I completed two version of a leetcode algorithm and am wondering if my complexity analysis is correct, even though the online submission time in ms does not show it accurately. The goal is to take a vector of numbers as a reference and return true if it contains duplicate values and false if it does not.

The two most intuitive approaches are:

1.) Sort the vector and do one sweep to the second to last, and see if any neighboring elements are identical and return true if so.

2.) Use a hashtable and insert the values and if a key already exists in the table, return true.

I completed the first version first, and it was quick, but seeing as how the sort routine would take O(nlog(n)) and the hash table inserts & map.count() s would make the second version O(log(n) + N) = O(N) I would think the hashing version would be faster with very large data sets.

In the online judging I was proven wrong, however I assumed they weren't using large enough data sets to offset the std::map overhead. So I ran a lot of tests repeatedly filling vectors up to a size between 0 and 10000 incrementing by 2, adding random values in between 0 and 20000. I piped the output to a csv file and plotted it on linux and here's the image I got.

情节

Is the provided image truly showing me the difference here, between an O(N) and an O(nlog(n)) algorithm? I just want to make sure my complexity analysis is correct on these?

Here are the algorithms run:

bool containsDuplicate(vector<int>& nums) {
  if(nums.size() < 2) return false;
  sort(nums.begin(), nums.end());

  for(int i = 0; i < nums.size()-1; ++i) {
    if(nums[i] == nums[i+1]) return true;
  }
  return false;
}
// Slightly slower in small cases because of data structure overhead I presume
bool containsDuplicateWithHashing(vector<int>& nums) {
  map<int, int> map;
  for (int i = 0; i < nums.size(); ++i) {
    if(map.count(nums[i])) return true;
    map.insert({nums[i], i});
  }
  return false;
}

std::map is sorted, and involves O(log n) cost for each insertion and lookup , so the total cost in the "no duplicates" case (or in the "first duplicate near the end of the vector" case) would have similar big-O to sorting and scanning: O(n log n) ; it's typically fragmented in memory, so overhead could easily be higher than that of an optimized std::sort .

It would appear much faster if duplicates were common though; if you usually find a duplicate in the first 10 elements, it doesn't matter if the input has 10,000 elements, because the map doesn't have time to grow before you hit a duplicate and duck out. It's just that a test that only works well when it succeeds is not a very good test for general usage (if duplicates are that common, the test seems a bit silly); you want good performance in both the contains duplicate and doesn't contain duplicate cases.

If you're looking to compare approaches with meaningfully different algorithmic complexity, try using std::unordered_set to replace your map-based solution ( insert returns whether the key already existed as well, so you reduce work from one lookup followed by one insert to just one combined insert and lookup on each loop), which has average case O(1) insertion and lookup, for O(n) duplicate checking complexity.

FYI, another approach that would be O(n log n) but use a sort-like strategy that shortcuts when a duplicate is found early, would be to make a heap with std::make_heap ( O(n) work), then repeatedly pop_heap ( O(log n) per pop) from the heap and compare to the heap's .front() ; if the value you just popped and the front are the same, you've got a duplicate and can exit immediately. You could also use the priority_queue adapter to simplify this into a single container, instead of manually using the utility functions on a std::vector or the like.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM