How do I efficiently copy unique objects from one vector to another (which is made up of a subset of identical objects)?

Question

How can I efficiently copy objects (or a range of objects) from vector A into vector B,

where vector B already contains certain objects identical to those from vector A,

so that no objects copied from vector A are already listed in vector B?

I have a graph stored as a vector of edges in std::vector<MinTreeEdge>minTreeInput .

I have a minimum spanning tree created from this graph, stored in std::vector<MinTreeEdge>minTreeOutput .

I'm trying to add a randomly add a certain number of edges back into minTreeOutput . To do this, I want to copy elements from minTreeInput back into minTreeOutput until the latter contains the required number of edges. Of course, each edge object that is copied over must not already be stored minTreeOutput . Can't have duplicate edges in this graph.

Below is what I've come up with so far. It works, but it's really long and I know the loop will have to be run many times depending on the graph and tree. I'd like to know how to do this properly:

    // Edge class
    struct MinTreeEdge
    {
        // For std::unique() between objects
        bool operator==(MinTreeEdge const &rhs) const noexcept
        {
            return lhs == rhs.lhs;
        }
        int lhs;

        int node1ID;
        int node2ID;
        int weight;
        ......
    };

             ......

    // The usage
    int currentSize = minTreeOutput.size();
    int targetSize = currentSize + numberOfEdgesToReturn;
    int sizeDistance = targetSize - currentSize;
    while(sizeDistance != 0)
    {
        //Probably really inefficient

        for(std::vector<MinTreeEdge>::iterator it = minTreeInput.begin(); it != minTreeInput.begin()+sizeDistance; ++it)
            minTreeOutput.push_back(*it);

        std::vector<MinTreeEdge>::iterator mto_it;
        mto_it = std::unique (minTreeOutput.begin(), minTreeOutput.end());

        currentSize = minTreeOutput.size();
        sizeDistance = targetSize - currentSize;
    }

Alternatively, is there a way to just list all the edges in minTreeInput (graph) that are not in minTreeOutput (tree) without having to check each individual element in the former against the latter?

Answer 1

How can I efficiently copy objects (or a range of objects) from vector A into vector B, where vector B already contains certain objects identical to those from vector A, so that no objects copied from vector A are already listed in vector B?

If I understand the question correctly, this can be paraphrased to "how can I create a set union of two vectors?".

Answer: with std::set_union

set_union where MinTreeEdge is cheap to copy

Note that for this to work it requires that the two vectors are sorted. This is for efficiency reasons, as you have already touched upon.

#include <vector>
#include <algorithm>
#include <cassert>
#include <iterator>

struct MinTreeEdge
    {
        // For std::unique() between objects
        bool operator==(MinTreeEdge const &rhs) const noexcept
        {
            return lhs == rhs.lhs;
        }
        int lhs;

        int node1ID;
        int node2ID;
        int weight;
    };

struct lower_lhs
{
  bool operator()(const MinTreeEdge& l, const MinTreeEdge& r) const noexcept
  {
    return l.lhs < r.lhs;
  }
};

std::vector<MinTreeEdge> merge(std::vector<MinTreeEdge> a, 
                               std::vector<MinTreeEdge> b)
{
  // let's pessimistically assume that the inputs are not sorted
  // we could simply assert that they are if the caller is aware of
  // the requirement

  std::sort(a.begin(), a.end(), lower_lhs());
  std::sort(b.begin(), b.end(), lower_lhs());

  // alternatively...
  // assert(std::is_sorted(a.begin(), a.end(), lower_lhs()));
  // assert(std::is_sorted(b.begin(), b.end(), lower_lhs()));

  // optional step if the inputs are not already `unique`
  a.erase(std::unique(a.begin(), a.end()), a.end());
  b.erase(std::unique(b.begin(), b.end()), b.end());

  std::vector<MinTreeEdge> result;
  result.reserve(a.size() + b.size());

  std::set_union(a.begin(), a.end(),
                        b.begin(), b.end(),
                        std::back_inserter(result), 
                        lower_lhs());

  return result;
}

int main()
{
  // example use case

  auto a = std::vector<MinTreeEdge>{};
  auto b = std::vector<MinTreeEdge>{};

  b = merge(std::move(a), std::move(b));
}

set_union where MinTreeEdge is expensive to copy

There has been some mention of sets to accomplish this. And it is fair to say that if:

MinTreeEdge is expensive to copy and,
there are a great many of them

then we could expect to see a performance benefit in using an unordered_set . However, if the objects are expensive to copy then we would probably want to store them in our temporary set by reference.

I might do it this way:

// utility class which converts unary and binary operations on
// a reference_wrapper into unary and binary operations on the 
// referred-to objects
template<class unary, class binary>
struct reference_as_object
{
    template<class U>
    decltype(auto) operator()(const std::reference_wrapper<U>& l) const {
        return _unary(l.get());
    }

    template<class U, class V>
    decltype(auto) operator()(const std::reference_wrapper<U>& l,
                              const std::reference_wrapper<V>& r) const {
        return _binary(l.get(), r.get());
    }

    unary _unary;
    binary _binary;
};

// utility to help prevent typos when defining a set of references
template<class K, class H, class C> using unordered_reference_set =
std::unordered_set<
std::reference_wrapper<K>,
reference_as_object<H, C>,
reference_as_object<H, C>
>;

// define unary and binary operations for our set. This way we can
// avoid polluting MinTreeEdge with artificial relational operators

struct mte_hash
{
    std::size_t operator()(const MinTreeEdge& mte) const
    {
        return std::hash<int>()(mte.lhs);
    }
};

struct mte_equal
{
    bool operator()(MinTreeEdge const& l, MinTreeEdge const& r) const
    {
        return l.lhs == r.lhs;
    }
};

// merge function. arguments by value since we will be moving
// *expensive to copy* objects out of them, and the vectors themselves
// can be *moved* into our function very cheaply

std::vector<MinTreeEdge> merge2(std::vector<MinTreeEdge> a,
                                std::vector<MinTreeEdge> b)
{
    using temp_map_type = unordered_reference_set<MinTreeEdge, mte_hash, mte_equal>;

    // build a set of references to existing objects in b
    temp_map_type tmap;
    tmap.reserve(b.capacity());

    // b first, since the requirements mentioned 'already in B'
    for (auto& ob : b) { tmap.insert(ob); }

    // now add missing references in a
    for (auto& oa : a) { tmap.insert(oa); }

    // now build the result, moving objects from a and b as required
    std::vector<MinTreeEdge> result;
    result.reserve(tmap.size());

    for (auto r : tmap) {
        result.push_back(std::move(r.get()));
    }

    return result;

    // a and b now have elements which are valid but in an undefined state
    // The elements which are defined are the duplicates we don't need
    // on summary, they are of no use to us so we drop them.
}

Trimmings - MinTreeEdge is expensive to copy but very cheap to move

Let's say that we wanted to stick with the vector method (we almost always should), but that MinTreeEdge was a little expensive to copy. Say it uses a pimpl idiom for internal polymorphism which will inevitably mean a memory allocation on copy. But let's say that it's cheaply moveable. Let's also imagine that the caller cannot be expected to sort or uniqueify data before sending it to us.

We can still achieve good efficiency with standard algorithms and vectors:

std::vector<MinTreeEdge> merge(std::vector<MinTreeEdge> a,
                               std::vector<MinTreeEdge> b)
{
    // sorts a range if not already sorted
    // @return a reference to the range
    auto maybe_sort = [] (auto& c) -> decltype(auto)
    {
        auto begin = std::begin(c);
        auto end = std::end(c);
        if (not std::is_sorted(begin, end, lower_lhs()))
            std::sort(begin, end, lower_lhs());
        return c;
    };

    // uniqueify a range, returning the new 'end' of
    // valid data
    // @pre c is sorted
    // @return result of std::unique(...)
    auto unique = [](auto& c) -> decltype(auto)
    {
        auto begin = std::begin(c);
        auto end = std::end(c);
        return std::unique(begin, end);
    };

    // turn an iterator into a move-iterator        
    auto mm = [](auto iter) { return std::make_move_iterator(iter); };


    std::vector<MinTreeEdge> result;
    result.reserve(a.size() + b.size());

    // create a set_union from two input containers.
    // @post a and b shall be in a valid but undefined state

    std::set_union(mm(a.begin()), mm(unique(maybe_sort(a))),
                   mm(b.begin()), mm(unique(maybe_sort(b))),
                   std::back_inserter(result),
                   lower_lhs());

    return result;
}

If one provides a free function void swap(MinTreeEdge& l, MinTreeEdge& r) nothrow then this function will require exactly N moves, where N is the size of the result set. Since in a pimpl class, a move is simply a pointer swap, this algorithm remains efficient.

Answer 2

Since your output vector should not contain duplicates, one way to accomplish not storing duplicates is to change the output container to a std::set<MinEdgeTree> instead of std::vector<MinEdgeTree> . The reason is that a std::set does not store duplicates, thus you do not have to write the code to do this check yourself.

First, you need to define an operator < for your MinEdgeTree class:

 struct MinTreeEdge
 {
     // For std::unique() between objects
     bool operator==(MinTreeEdge const &rhs) const noexcept
     {
         return lhs == rhs.lhs;
     }
     // For std::unique() between objects
     bool operator<(MinTreeEdge const &rhs) const noexcept
     {
         return lhs < rhs.lhs;
     }
//...
};

Once you do that, the while loop can be replaced with the following:

#include <set>
#include <vector>
#include <iterator>
#include <algorithm>
//...
std::vector<MinTreeEdge> minTreeInput;
//...
std::set<MinTreeEdge> minTreeOutput;
//...
std::copy(minTreeInput.begin(), minTreeInput.end(), 
          std::inserter(minTreeOutput, minTreeOutput.begin()));

There is no need to call std::unique at all, since it is the std::set that will check for the duplicates.

If the output container has to stay as a std::vector , you can still do the above using a temporary std::set and then copy the std::set to the output vector:

std::vector<MinTreeEdge> minTreeInput;
std::vector<MinTreeEdge> minTreeOutput;
//... 
std::set<MinTreeEdge> tempSet;
std::copy(minTreeInput.begin(), minTreeInput.end(), 
          std::inserter(tempSet, tempSet.begin())); 

std::copy(tempSet.begin(), tempSet.end(),std::back_inserter(minTreeOutput));

Answer 3

You may use the following:

struct MinTreeEdge
{
    bool operator<(MinTreeEdge const &rhs) const noexcept
    {
        return id < rhs.id;
    }
    int id;

    int node1ID;
    int node2ID;
    int weight;
};

std::vector<MinTreeEdge> CreateRandomGraph(const std::vector<MinTreeEdge>& minSpanningTree,
                                           const std::vector<MinTreeEdge>& wholeTree,
                                           std::mt19937& rndEng,
                                           std::size_t expectedSize)
{
    assert(std::is_sorted(minSpanningTree.begin(), minSpanningTree.end())); 
    assert(std::is_sorted(wholeTree.begin(), wholeTree.end())); 
    assert(minSpanningTree.size() <= expectedSize);
    assert(expectedSize <= wholeTree.size());

    std::vector<MinTreeEdge> res;
    std::set_difference(wholeTree.begin(), wholeTree.end(),
                        minSpanningTree.begin(), minSpanningTree.end(),
                        std::back_inserter(res));

    std::shuffle(res.begin(), res.end(), rndEng);
    res.resize(expectedSize - minSpanningTree.size());
    res.insert(res.end(), minSpanningTree.begin(), minSpanningTree.end());
    // std::sort(res.begin(), res.end());
    return res;
}

How do I efficiently copy unique objects from one vector to another (which is made up of a subset of identical objects)?

Question

3 answers

solution1
5 ACCPTED 2016-08-21 19:37:07

set_union where MinTreeEdge is cheap to copy

set_union where MinTreeEdge is expensive to copy

Trimmings - MinTreeEdge is expensive to copy but very cheap to move

solution2
1 2016-08-21 19:59:57

solution3
0 2016-08-22 08:17:15

How do I efficiently copy unique objects from one vector to another (which is made up of a subset of identical objects)?

Question

3 answers

solution1 5 ACCPTED 2016-08-21 19:37:07

set_union where MinTreeEdge is cheap to copy

set_union where MinTreeEdge is expensive to copy

Trimmings - MinTreeEdge is expensive to copy but very cheap to move

solution2 1 2016-08-21 19:59:57

solution3 0 2016-08-22 08:17:15

solution1
5 ACCPTED 2016-08-21 19:37:07

solution2
1 2016-08-21 19:59:57

solution3
0 2016-08-22 08:17:15