简体   繁体   English

如何在使用算法保持原始顺序的同时从未排序的 std::vector 中删除重复项?

[英]How to remove duplicates from unsorted std::vector while keeping the original ordering using algorithms?

I have an array of integers that I need to remove duplicates from while maintaining the order of the first occurrence of each integer.我有一个整数数组,我需要从中删除重复项,同时保持每个整数第一次出现的顺序。 I can see doing it like this, but imagine there is a better way that makes use of STL algorithms better?我可以看到这样做,但想象一下有更好的方法可以更好地利用 STL 算法吗? The insertion is out of my control, so I cannot check for duplicates before inserting.插入是我无法控制的,所以我无法在插入前检查重复项。

int unsortedRemoveDuplicates(std::vector<int> &numbers) {
    std::set<int> uniqueNumbers;
    std::vector<int>::iterator allItr = numbers.begin();
    std::vector<int>::iterator unique = allItr;
    std::vector<int>::iterator endItr = numbers.end();

    for (; allItr != endItr; ++allItr) {
        const bool isUnique = uniqueNumbers.insert(*allItr).second;

        if (isUnique) {
            *unique = *allItr;
            ++unique;
        }
    }

    const int duplicates = endItr - unique;

    numbers.erase(unique, endItr);
    return duplicates;
}

How can this be done using STL algorithms?如何使用 STL 算法完成此操作?

Sounds like a job for std::copy_if .听起来像是std::copy_if 的工作 Define a predicate that keeps track of elements that already have been processed and return false if they have.定义一个谓词来跟踪已经被处理过的元素,如果有则返回 false。

If you don't have C++11 support, you can use the clumsily named std::remove_copy_if and invert the logic.如果您没有 C++11 支持,您可以使用命名笨拙的std::remove_copy_if并反转逻辑。

This is an untested example:这是一个未经测试的示例:

template <typename T>
struct NotDuplicate {
  bool operator()(const T& element) {
    return s_.insert(element).second; // true if s_.insert(element);
  }
 private:
  std::set<T> s_;
};

Then然后

std::vector<int> uniqueNumbers;
NotDuplicate<int> pred;
std::copy_if(numbers.begin(), numbers.end(), 
             std::back_inserter(uniqueNumbers),
             std::ref(pred));

where an std::ref has been used to avoid potential problems with the algorithm internally copying what is a stateful functor, although std::copy_if does not place any requirements on side-effects of the functor being applied.其中std::ref已被用于避免算法内部复制有状态函子的潜在问题,尽管std::copy_if没有对所应用的函子的副作用提出任何要求。

The naive way is to use std::set as everyone tells you.天真的方法是像每个人都告诉你的那样使用std::set It's overkill and has poor cache locality (slow).它是矫枉过正并且缓存局部性很差(慢)。
The smart* way is to use std::vector appropriately (make sure to see footnote at bottom):聪明*的方法是适当地使用std::vector (确保在底部看到脚注):

#include <algorithm>
#include <vector>
struct target_less
{
    template<class It>
    bool operator()(It const &a, It const &b) const { return *a < *b; }
};
struct target_equal
{
    template<class It>
    bool operator()(It const &a, It const &b) const { return *a == *b; }
};
template<class It> It uniquify(It begin, It const end)
{
    std::vector<It> v;
    v.reserve(static_cast<size_t>(std::distance(begin, end)));
    for (It i = begin; i != end; ++i)
    { v.push_back(i); }
    std::sort(v.begin(), v.end(), target_less());
    v.erase(std::unique(v.begin(), v.end(), target_equal()), v.end());
    std::sort(v.begin(), v.end());
    size_t j = 0;
    for (It i = begin; i != end && j != v.size(); ++i)
    {
        if (i == v[j])
        {
            using std::iter_swap; iter_swap(i, begin);
            ++j;
            ++begin;
        }
    }
    return begin;
}

Then you can use it like:然后你可以像这样使用它:

int main()
{
    std::vector<int> v;
    v.push_back(6);
    v.push_back(5);
    v.push_back(5);
    v.push_back(8);
    v.push_back(5);
    v.push_back(8);
    v.erase(uniquify(v.begin(), v.end()), v.end());
}

*Note: That's the smart way in typical cases , where the number of duplicates isn't too high. *注意:这是典型情况下的聪明方法,其中重复的数量不太高。 For a more thorough performance analysis, see this related answer to a related question .如需更全面的性能分析,请参阅相关问题的相关回答

Fast and simple, C++11:快速而简单,C++11:

template<typename T>
size_t RemoveDuplicatesKeepOrder(std::vector<T>& vec)
{
    std::set<T> seen;

    auto newEnd = std::remove_if(vec.begin(), vec.end(), [&seen](const T& value)
    {
        if (seen.find(value) != std::end(seen))
            return true;

        seen.insert(value);
        return false;
    });

    vec.erase(newEnd, vec.end());

    return vec.size();
}
int unsortedRemoveDuplicates(std::vector<int>& numbers)
{
    std::set<int> seenNums; //log(n) existence check

    auto itr = begin(numbers);
    while(itr != end(numbers))
    {
        if(seenNums.find(*itr) != end(seenNums)) //seen? erase it
            itr = numbers.erase(itr); //itr now points to next element
        else
        {
            seenNums.insert(*itr);
            itr++;
        }
    }

    return seenNums.size();
}


//3 6 3 8 9 5 6 8
//3 6 8 9 5

To verify the performance of the proposed solutions, I've tested three of them, listed below.为了验证建议的解决方案的性能,我测试了下面列出的三个。 The tests are using random vectors with 1 mln elements and different ratio of duplicates (0%, 1%, 2%, ..., 10%, ..., 90%, 100%).测试使用具有 100 万个元素和不同重复比例(0%、1%、2%、...、10%、...、90%、100%)的随机向量。

  • Mehrdad's solution , currently the accepted answer: Mehrdad 的解决方案,目前接受的答案:

     void uniquifyWithOrder_sort(const vector<int>&, vector<int>& output) { using It = vector<int>::iterator; struct target_less { bool operator()(It const &a, It const &b) const { return *a < *b; } }; struct target_equal { bool operator()(It const &a, It const &b) const { return *a == *b; } }; auto begin = output.begin(); auto const end = output.end(); { vector<It> v; v.reserve(static_cast<size_t>(distance(begin, end))); for (auto i = begin; i != end; ++i) { v.push_back(i); } sort(v.begin(), v.end(), target_less()); v.erase(unique(v.begin(), v.end(), target_equal()), v.end()); sort(v.begin(), v.end()); size_t j = 0; for (auto i = begin; i != end && j != v.size(); ++i) { if (i == v[j]) { using std::iter_swap; iter_swap(i, begin); ++j; ++begin; } } } output.erase(begin, output.end()); }
  • juanchopanza's solution juanchopanza 的解决方案

    void uniquifyWithOrder_set_copy_if(const vector<int>& input, vector<int>& output) { struct NotADuplicate { bool operator()(const int& element) { return _s.insert(element).second; } private: set<int> _s; }; vector<int> uniqueNumbers; NotADuplicate pred; output.clear(); output.reserve(input.size()); copy_if( input.begin(), input.end(), back_inserter(output), ref(pred)); }
  • Leviathan's solution利维坦的解决方案

    void uniquifyWithOrder_set_remove_if(const vector<int>& input, vector<int>& output) { set<int> seen; auto newEnd = remove_if(output.begin(), output.end(), [&seen](const int& value) { if (seen.find(value) != end(seen)) return true; seen.insert(value); return false; }); output.erase(newEnd, output.end()); }

They are slightly modified for simplicity, and to allow comparing in-place solutions with not in-place ones.为简单起见,它们略有修改,并允许将就地解决方案与非就地解决方案进行比较。 The full code used to test is available here .用于测试的完整代码可在此处获得

The results suggest that if you know you'll have at least 1% duplicates the remove_if solution with std::set is the best one.结果表明,如果您知道至少有 1% 的重复,那么带有std::setremove_if解决方案是最好的解决方案。 Otherwise, you should go with the sort solution:否则,您应该使用sort解决方案:

// Intel(R) Core(TM) i7-2600 CPU @ 3.40 GHz 3.40 GHz
// 16 GB RAM, Windows 7, 64 bit
//
// cl 19
// /GS /GL /W3 /Gy /Zc:wchar_t /Zi /Gm- /O2 /Zc:inline /fp:precise /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /WX- /Zc:forScope /Gd /Oi /MD /EHsc /nologo /Ot 
//
// 1000 random vectors with 1 000 000 elements each.
// 11 tests: with 0%, 10%, 20%, ..., 90%, 100% duplicates in vectors.

// Ratio: 0
// set_copy_if   : Time : 618.162 ms +- 18.7261 ms
// set_remove_if : Time : 650.453 ms +- 10.0107 ms
// sort          : Time : 212.366 ms +- 5.27977 ms
// Ratio : 0.1
// set_copy_if   : Time : 34.1907 ms +- 1.51335 ms
// set_remove_if : Time : 24.2709 ms +- 0.517165 ms
// sort          : Time : 43.735 ms +- 1.44966 ms
// Ratio : 0.2
// set_copy_if   : Time : 29.5399 ms +- 1.32403 ms
// set_remove_if : Time : 20.4138 ms +- 0.759438 ms
// sort          : Time : 36.4204 ms +- 1.60568 ms
// Ratio : 0.3
// set_copy_if   : Time : 32.0227 ms +- 1.25661 ms
// set_remove_if : Time : 22.3386 ms +- 0.950855 ms
// sort          : Time : 38.1551 ms +- 1.12852 ms
// Ratio : 0.4
// set_copy_if   : Time : 30.2714 ms +- 1.28494 ms
// set_remove_if : Time : 20.8338 ms +- 1.06292 ms
// sort          : Time : 35.282 ms +- 2.12884 ms
// Ratio : 0.5
// set_copy_if   : Time : 24.3247 ms +- 1.21664 ms
// set_remove_if : Time : 16.1621 ms +- 1.27802 ms
// sort          : Time : 27.3166 ms +- 2.12964 ms
// Ratio : 0.6
// set_copy_if   : Time : 27.3268 ms +- 1.06058 ms
// set_remove_if : Time : 18.4379 ms +- 1.1438 ms
// sort          : Time : 30.6846 ms +- 2.52412 ms
// Ratio : 0.7
// set_copy_if   : Time : 30.3871 ms +- 0.887492 ms
// set_remove_if : Time : 20.6315 ms +- 0.899802 ms
// sort          : Time : 33.7643 ms +- 2.2336 ms
// Ratio : 0.8
// set_copy_if   : Time : 33.3077 ms +- 0.746272 ms
// set_remove_if : Time : 22.9459 ms +- 0.921515 ms
// sort          : Time : 37.119 ms +- 2.20924 ms
// Ratio : 0.9
// set_copy_if   : Time : 36.0888 ms +- 0.763978 ms
// set_remove_if : Time : 24.7002 ms +- 0.465711 ms
// sort          : Time : 40.8233 ms +- 2.59826 ms
// Ratio : 1
// set_copy_if   : Time : 21.5609 ms +- 1.48986 ms
// set_remove_if : Time : 14.2934 ms +- 0.535431 ms
// sort          : Time : 24.2485 ms +- 0.710269 ms

// Ratio: 0
// set_copy_if   : Time: 666.962 ms +- 23.7445 ms
// set_remove_if : Time: 736.088 ms +- 39.8122 ms
// sort          : Time: 223.796 ms +- 5.27345 ms
// Ratio: 0.01
// set_copy_if   : Time: 60.4075 ms +- 3.4673 ms
// set_remove_if : Time: 43.3095 ms +- 1.31252 ms
// sort          : Time: 70.7511 ms +- 2.27826 ms
// Ratio: 0.02
// set_copy_if   : Time: 50.2605 ms +- 2.70371 ms
// set_remove_if : Time: 36.2877 ms +- 1.14266 ms
// sort          : Time: 62.9786 ms +- 2.69163 ms
// Ratio: 0.03
// set_copy_if   : Time: 46.9797 ms +- 2.43009 ms
// set_remove_if : Time: 34.0161 ms +- 0.839472 ms
// sort          : Time: 59.5666 ms +- 1.34078 ms
// Ratio: 0.04
// set_copy_if   : Time: 44.3423 ms +- 2.271 ms
// set_remove_if : Time: 32.2404 ms +- 1.02162 ms
// sort          : Time: 57.0583 ms +- 2.9226 ms
// Ratio: 0.05
// set_copy_if   : Time: 41.758 ms +- 2.57589 ms
// set_remove_if : Time: 29.9927 ms +- 0.935529 ms
// sort          : Time: 54.1474 ms +- 1.63311 ms
// Ratio: 0.06
// set_copy_if   : Time: 40.289 ms +- 1.85715 ms
// set_remove_if : Time: 29.2604 ms +- 0.593869 ms
// sort          : Time: 57.5436 ms +- 5.52807 ms
// Ratio: 0.07
// set_copy_if   : Time: 40.5035 ms +- 1.80952 ms
// set_remove_if : Time: 29.1187 ms +- 0.63127 ms
// sort          : Time: 53.622 ms +- 1.91357 ms
// Ratio: 0.08
// set_copy_if   : Time: 38.8139 ms +- 1.9811 ms
// set_remove_if : Time: 27.9989 ms +- 0.600543 ms
// sort          : Time: 50.5743 ms +- 1.35296 ms
// Ratio: 0.09
// set_copy_if   : Time: 39.0751 ms +- 1.71393 ms
// set_remove_if : Time: 28.2332 ms +- 0.607895 ms
// sort          : Time: 51.2829 ms +- 1.21077 ms
// Ratio: 0.1
// set_copy_if   : Time: 35.6847 ms +- 1.81495 ms
// set_remove_if : Time: 25.204 ms +- 0.538245 ms
// sort          : Time: 46.4127 ms +- 2.66714 ms

Here is what WilliamKF is searching for.这就是 WilliamKF 正在寻找的内容。 It uses the erase statement.它使用擦除语句。 This code is good for lists but isn t good for vectors.此代码适用于列表,但不适用于向量。 For vectors you should not use the erase statement.对于向量,您不应使用擦除语句。

//makes uniques in one shot without sorting !! 
template<class listtype> inline
void uniques(listtype* In)
    {

    listtype::iterator it = In->begin();
    listtype::iterator it2= In->begin();

    int tmpsize = In->size();

        while(it!=In->end())
        {
        it2 = it;
        it2++;
        while((it2)!=In->end())
            {
            if ((*it)==(*it2))
                In->erase(it2++);
            else
                ++it2;
            }
        it++;

        }
    }

What I have tryed for vectors without using sort is that:我在不使用排序的情况下对向量进行的尝试是:

//makes vectors as fast as possible unique
template<typename T> inline
void vectoruniques(std::vector<T>* In)
    {

    int tmpsize = In->size();

        for (std::vector<T>::iterator it = In->begin();it<In->end()-1;it++)
        {
            T tmp = *it;
            for (std::vector<T>::iterator it2 = it+1;it2<In->end();it2++)
            {
                if (*it2!=*it)
                    tmp = *it2;
                else
                    *it2 = tmp;
            }
        }
        std::vector<T>::iterator it = std::unique(In->begin(),In->end());
        int newsize = std::distance(In->begin(),it);
            In->resize(newsize);
    }

Somehow it looks like this would work.不知何故,这看起来会奏效。 I tested it a bit maybe can somebody tell if this really works !我测试了一下,也许有人可以判断这是否真的有效! This solution doesn t need any greater operator.此解决方案不需要任何更大的运算符。 I mean why use the greater operator for seaching unique elements ?我的意思是为什么使用更大的运算符来搜索唯一元素? Usage for Vectors:向量的用法:

int myints[] = {21,10,20,20,20,30,21,31,20,20,2}; 
std::vector<int> abc(myints , myints+11);
vectoruniques(&abc);

Here's something that handles POD and non-POD types with move support.这是处理带有移动支持的 POD 和非 POD 类型的东西。 Uses default operator== or a custom equality predicate.使用默认 operator== 或自定义相等谓词。 Does not require sorting/operator<, key generation, or a separate set.不需要排序/运算符<、密钥生成或单独的集合。 No idea if this is more efficient than the other methods described above.不知道这是否比上述其他方法更有效。

template <typename Cnt, typename _Pr = std::equal_to<typename Cnt::value_type>>
void remove_duplicates( Cnt& cnt, _Pr cmp = _Pr() )
{
    Cnt result;
    result.reserve( std::size( cnt ) );  // or cnt.size() if compiler doesn't support std::size()

    std::copy_if( 
        std::make_move_iterator( std::begin( cnt ) )
        , std::make_move_iterator( std::end( cnt ) )
        , std::back_inserter( result )
        , [&]( const typename Cnt::value_type& what ) 
        { 
            return std::find_if( 
                std::begin( result )
                , std::end( result )
                , [&]( const typename Cnt::value_type& existing ) { return cmp( what, existing ); }
            ) == std::end( result );
        }
    );  // copy_if

    cnt = std::move( result );  // place result in cnt param
}   // remove_duplicates

Usage/tests:使用/测试:

{
    std::vector<int> ints{ 0,1,1,2,3,4 };
    remove_duplicates( ints );
    assert( ints.size() == 5 );
}

{
    struct data 
    { 
        std::string foo; 
        bool operator==( const data& rhs ) const { return this->foo == rhs.foo; }
    };

    std::vector<data>
        mydata{ { "hello" }, {"hello"}, {"world"} }
        , mydata2 = mydata
        ;

    // use operator==
    remove_duplicates( mydata );
    assert( mydata.size() == 2 );

    // use custom predicate
    remove_duplicates( mydata2, []( const data& left, const data& right ) { return left.foo == right.foo; } );
    assert( mydata2.size() == 2 );

}

As your vector contains integers, a much faster solution than "sort and unique", is to declare before, a big global or static big array initialized with -1 (t0 in my example) and use it.由于您的向量包含整数,因此比“排序和唯一”快得多的解决方案是在之前声明一个用 -1 初始化的大全局或静态大数组(在我的示例中为 t0)并使用它。 Results proves its about 50 times faster than "sort and unique":结果证明它比“排序和唯一”快大约 50 倍:

results :
duration sort then unique == 36871
duration unique without sorting == 656
unique without sort is ok

Here is the code:这是代码:

int main(int argc, char*argv []) {
  //I am allways declaring at least 2 global big vector<int> in my softwares, to speed up operations like unique
  //so I am not counting the creation of these "huge" vector in algo duration.
  const int N (5000); //max possible integer of vector tested
  const auto zero ((const int) 0), sm1 ((const int) -1);
  std::vector<int> t0 (N, sm1);

  //init vector to render unique
  std::vector<int> vec (1000000, sm1);
  std::for_each (vec.begin (), vec.end (), [] (auto& s) {s = rand ()%1000;}); //1000 < N 
  std::vector<int> vec1 (vec);

  clock_t beg (clock ()), end (beg);
  {
    beg = clock ();
    std::sort (vec.begin (), vec.end ());
    auto j (std::unique (vec.begin (), vec.end ()));
    if (j != vec.end ()) vec.erase (j, vec.end ());
    end = clock ();
    std::cout << "\tduration sort then unique == " << (end - beg) << std::endl;
  }

  {
    beg = clock ();
    auto j (vec1.begin ());
    std::for_each (vec1.begin (), vec1.end (), [&j, &t0, &zero] (const auto& s) {
      if (t0 [s]) {
        *j++ = s;
        t0 [s] = zero;
      }
    });
    if (j != vec1.end ()) vec1.erase (j, vec1.end ());
    //t0 clean t0
    std::for_each (vec1.begin (), vec1.end (), [&t0, &sm1] (const auto&s) {t0 [s] = sm1;});        
    end = clock ();
    //sorting is just here to compare with sort+unique, but not counting in duration
    std::sort (vec1.begin (), vec1.end ());

    std::cout << "\tduration unique without sorting == " << (end - beg) << std::endl;
  }
  if (vec == vec1) std::cout << "\tunique without sort is ok" << std::endl;
  else std::cout << "\tunique without sort is nok" << std::endl;
}

Here is a c++11 generic version that works with iterators and doesn't allocate additional storage.这是一个 c++11 通用版本,它与迭代器一起工作并且不分配额外的存储空间。 It may have the disadvantage of being O(n^2) but is likely faster for smaller input sizes.它可能有 O(n^2) 的缺点,但对于较小的输入大小可能更快。

template<typename Iter>
Iter removeDuplicates(Iter begin,Iter end)
{
    auto it = begin;
    while(it != end)
    {
        auto next = std::next(it);
        if(next == end)
        {
            break;
        }
        end = std::remove(next,end,*it);
        it = next;
    }

    return end;
}

.... ....

std::erase(removeDuplicates(vec.begin(),vec.end()),vec.end());

Sample Code: http://cpp.sh/5kg5n示例代码: http : //cpp.sh/5kg5n

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从双精度数组的未排序向量中删除重复项 - Remove duplicates from unsorted vector of arrays of doubles 如何在跟踪原始索引的同时压缩向量(包含重复项)? - How to simultaneously compress a vector (with duplicates) while keeping track of the original indices? 如何在保持原始顺序的同时在 O(nlogn) 中删除未排序的向量? - How to dedup an unsorted vector within O(nlogn) while keeping its original order? 如何从std :: vector中删除重复项 <std::pair<UnicodeString, UnicodeString> &gt; - How to remove duplicates from std::vector <std::pair<UnicodeString, UnicodeString> > 如何从未排序的链表中删除重复项 - How to remove duplicates from an unsorted linked list 如何从C ++中的向量中删除重复项(具有原始值) - how to remove duplicates (with original value) from a vector in c++ 如何使用标准stl算法从istream填充std :: vector - how to fill std::vector from istream using standard stl algorithms 从未排序的数组中删除重复项 - Remove duplicates from an unsorted array 如何使用“唯一”从向量中删除重复项? - How to remove duplicates from vector using “unique”? 如何使用标准算法将 A 的向量复制到 A 指针的向量? - How to copy a vector of A to a vector of A pointer using std algorithms?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM