简体   繁体   中英

Deciding two integer list contains common elements or not (c++)

I have a running-time issue about my c++ program. The program doing millions of times comparing two integer list contains common elements or not. I don't need to learn which elements is common. I wrote the method below but it doesn't look efficient. I need to speed up program. So, what is the best way of doing this process or c++ have any built-in method which is doing this compare efficiently?

bool compareHSAndNewSet(list<int> hs , list<int> newset){
            bool isCommon = false;
            for(int x : hs){
                for(int y : newset){
                    if(x == y){isCommon = true; break;}
                }
                if(isCommon == true) {break;}
            }

            return isCommon;

        }

Hint: I don't now maybe this means something. The first input of the function (in the code hs) is ordered.

I was curious about the various strategies, so I made the simple benchmark below. However, I wouldn't try to sort the second container; comparing all the data inside a container and moving them around seems to be overkill just to find one element in the intersection.

The program gives these results on my computer (Intel(R) Core(TM) i7-10875H CPU @ 2.30GHz):

vectors: 1.41164
vectors (dichotomic): 0.0187354
lists: 12.0402
lists (dichotomic): 13.4844

If we ignore that the first container is sorted and iterate its elements in order, we can see that a simpler container (a vector here) with adjacent storage of the elements if much better than multiple elements spread in memory (a list here): 1.41164 s over 12.0402 (8.5 speedup).

But if we consider that the first container is sorted (as told in the question), a dichotomic approach can improve even more the situation.

The best case (dichotomic approach on vectors) is far better than the original case (in order approach on lists): 0.0187354 s over 12.0402 s (642 speedup).

Of course, all of this depends on many other factors (sizes of datasets, distributions of the values...); this is just a micro benchmark, and a specific application could behave differently.

Note that in the question, the parameters were passed by value; this will probably cause some unneeded copies (except if a move operation is used at the call site, but I would find that uncommon for such a function). I switched to pass-by-reference-on-const instead.

Note also that a dichotomic approach on a list is a pessimisation (no random access for the iterators, so it's still linear but more complicated than the simplest linear approach).

edit : my original code was wrong , thanks to @bitmask I changed it; it does not change the general idea.

/**
  g++ -std=c++17 -o prog_cpp prog_cpp.cpp \
      -pedantic -Wall -Wextra -Wconversion -Wno-sign-conversion \
      -O3 -DNDEBUG -march=native
**/

#include <list>
#include <vector>
#include <algorithm>
#include <chrono>
#include <random>
#include <tuple>
#include <iostream>

template<typename Container>
bool
compareHSAndNewSet(const Container &hs,
                   const Container &newset)
{
  for(const auto &elem: newset)
  {
    const auto it=std::find(cbegin(hs), cend(hs), elem);
    if(it!=cend(hs))
    {
      return true; // found common element
    }
  }
  return false; // no common element
}

template<typename Container>
bool
compareHSAndNewSet_dichotomic(const Container &hs,
                              const Container &newset)
{
  for(const auto &elem: newset)
  {
    if(std::binary_search(cbegin(hs), cend(hs), elem))
    {
      return true; // found common element
    }
  }
  return false; // no common element
}

std::tuple<std::vector<int>, // hs
           std::vector<int>> // newset
prepare_vectors()
{
  static auto rnd_gen=std::default_random_engine {std::random_device{}()};
  constexpr auto sz=10'000;
  auto distr=std::uniform_int_distribution<int>{0, 10*sz};
  auto hs=std::vector<int>{};
  auto newset=std::vector<int>{};
  for(auto i=0; i<sz; ++i)
  {
    hs.emplace_back(distr(rnd_gen));
    newset.emplace_back(distr(rnd_gen));
  }
  std::sort(begin(hs), end(hs));
  return {hs, newset};
}

std::tuple<std::list<int>, // hs
           std::list<int>> // newset
prepare_lists(const std::vector<int> &hs,
              const std::vector<int> &newset)
{
  return {std::list(cbegin(hs), cend(hs)),
          std::list(cbegin(newset), cend(newset))};
}

double // seconds (1e-6 precision) since 1970/01/01 00:00:00 UTC
get_time()
{
  const auto now=std::chrono::system_clock::now().time_since_epoch();
  const auto us=std::chrono::duration_cast<std::chrono::microseconds>(now);
  return 1e-6*double(us.count());
}

int
main()
{
  constexpr auto generations=100;
  constexpr auto iterations=1'000;
  auto duration_v=0.0;
  auto duration_vd=0.0;
  auto duration_l=0.0;
  auto duration_ld=0.0;
  for(auto g=0; g<generations; ++g)
  {
    const auto [hs_v, newset_v]=prepare_vectors();
    const auto [hs_l, newset_l]=prepare_lists(hs_v, newset_v);
    for(auto i=-1; i<iterations; ++i)
    {
      const auto t0=get_time();
      const auto comp_v=compareHSAndNewSet(hs_v, newset_v);
      const auto t1=get_time();
      const auto comp_vd=compareHSAndNewSet_dichotomic(hs_v, newset_v);
      const auto t2=get_time();
      const auto comp_l=compareHSAndNewSet(hs_l, newset_l);
      const auto t3=get_time();
      const auto comp_ld=compareHSAndNewSet_dichotomic(hs_l, newset_l);
      const auto t4=get_time();
      if((comp_v!=comp_vd)||(comp_v!=comp_l)||(comp_v!=comp_ld))
      {
        std::cerr << "comparison mismatch\n";
      }
      if(i>=0) // first iteration is dry-run (warmup)
      {
        duration_v+=t1-t0;
        duration_vd+=t2-t1;
        duration_l+=t3-t2;
        duration_ld+=t4-t3;
      }
    }
  }
  std::cout << "vectors: " << duration_v << '\n';
  std::cout << "vectors (dichotomic): " << duration_vd << '\n';
  std::cout << "lists: " << duration_l << '\n';
  std::cout << "lists (dichotomic): " << duration_ld << '\n';
  return 0;
}

You can try sorting the list and use set_intersection .

bool compareHSAndNewSet(list<int> hs , list<int> newset){
    hs.sort();
    newset.sort();
    list<int>::iterator i;
    list<int> commonElts (hs.size()+newset.size());
    i = std::set_intersection(hs.begin(), hs.end(), newset.begin(), newset.end(), commonElts.begin());
    commonElts.resize(i - commonElts.begin());
    return (v.size() == 0);    

I'd use std::unordered_map<> to add the first list to, then check each element of the second list if it exists in the map. This would end up iterating each list once, doing length(first) insertions and length(second) lookups on the map.

std::unordered_map<> should have a lookup and insertion complexity of O(1), though worst case could end up with O(n). (I believe).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM