简体   繁体   English

C ++如何将已排序的向量合并到一个已排序的向量/弹出所有这些向量中的最小元素?

[英]C++ How to merge sorted vectors into a sorted vector / pop the least element from all of them?

I have a collection of about a hundred or so sorted vector<int> 's Although most vectors have a small number of integers in them, some of the vectors contain a large (>10K) of them (thus the vectors don't necessarily have the same size). 我有一个大约一百个左右的排序vector<int>的集合虽然大多数向量中包含少量整数,但是一些向量包含大量(> 10K)它们(因此向量不一定具有相同的尺寸)。

What I'd like to do essentially iterate through smallest to largest integer, that are contained in all these sorted vectors. 我想要做的基本上是遍历从最小到最大的整数,它们包含在所有这些排序的向量中。

One way to do it would be to merge all these sorted vectors into a sorted vector & simply iterate. 一种方法是将所有这些排序的向量合并到一个有序向量中并简单地迭代。 Thus, 从而,

Question 1: What is the fastest way to merge sorted vectors into a sorted vector? 问题1:将排序后的向量合并为有序向量的最快方法是什么?

I'm sure on the other hand there are faster / clever ways to accomplish this without merging & re-sorting the whole thing -- perhaps popping the smallest integer iteratively from this collection of sorted vectors; 另一方面,我确信有更快/更聪明的方法来实现这一点,而无需合并和重新排序整个事物 - 也许从这个排序向量集合中迭代地弹出最小的整数; without merging them first.. so: 没有合并它们..所以:

Question 2: What is the fasted / best way to pop the least element from a bunch of sorted vector<int> 's? 问题2:从一堆有序vector<int>弹出最少元素的禁区/最佳方法是什么?


Based on replies below, and the comments to the question I've implemented an approach where I make a priority queue of iterators for the sorted vectors. 根据下面的回复,以及对问题的评论,我实现了一种方法,我为排序的向量创建迭代器的优先级队列。 I'm not sure if this is performance-efficient, but it seems to be very memory-efficient. 我不确定这是否具有性能效率,但它似乎非常节省内存。 I consider the question still open, since I'm not sure we've established the fastest way yet. 我认为问题仍然存在,因为我不确定我们是否已经建立了最快的方式。

// compare vector pointers by integers pointed
struct cmp_seeds {
    bool operator () (const pair< vector<int>::iterator, vector<int>::iterator> p1, const pair< vector<int>::iterator, vector<int>::iterator> p2) const {
        return *(p1.first) >  *(p2.first);      
    }
};

int pq_heapsort_trial() {

    /* Set up the Sorted Vectors */ 
    int a1[] = { 2, 10, 100};
    int a2[] = { 5, 15, 90, 200};
    int a3[] = { 12 };

    vector<int> v1 (a1, a1 + sizeof(a1) / sizeof(int));
    vector<int> v2 (a2, a2 + sizeof(a2) / sizeof(int));
    vector<int> v3 (a3, a3 + sizeof(a3) / sizeof(int));

    vector< vector <int> * > sorted_vectors;
    sorted_vectors.push_back(&v1);
    sorted_vectors.push_back(&v2);
    sorted_vectors.push_back(&v3);
    /* the above simulates the "for" i have in my own code that gives me sorted vectors */

    pair< vector<int>::iterator, vector<int>::iterator> c_lead;
    cmp_seeds mycompare;

    priority_queue< pair< vector<int>::iterator, vector<int>::iterator>, vector<pair< vector<int>::iterator, vector<int>::iterator> >, cmp_seeds> cluster_feeder(mycompare);


    for (vector<vector <int> *>::iterator k = sorted_vectors.begin(); k != sorted_vectors.end(); ++k) {
        cluster_feeder.push( make_pair( (*k)->begin(), (*k)->end() ));
    }


    while ( cluster_feeder.empty() != true) {
        c_lead = cluster_feeder.top();
        cluster_feeder.pop();
        // sorted output
        cout << *(c_lead.first) << endl;

        c_lead.first++;
        if (c_lead.first != c_lead.second) {
            cluster_feeder.push(c_lead);
        }
    }

    return 0;
}

One option is to use a std :: priority queue to maintain a heap of iterators, where the iterators bubble up the heap depending on the values they point at. 一种选择是使用std :: priority queue来维护迭代器堆,其中迭代器根据它们指向的值冒泡堆。

You could also consider using repeating applications of std :: inplace_merge . 您还可以考虑使用std :: inplace_merge重复应用程序。 This would involve appending all the data together into a big vector and remembering the offsets at which each distinct sorted block begins and ends, and then passing those into inplace_merge. 这将涉及将所有数据一起附加到一个大向量中并记住每个不同的排序块开始和结束的偏移量,然后将它们传递给inplace_merge。 This would probably be faster then the heap solution, although I think fundamentally the complexity is equivalent. 这可能比堆解决方案更快,尽管我认为从根本上说复杂性是相同的。

Update: I've implemented the second algorithm I just described. 更新:我已经实现了我刚才描述的第二种算法。 Repeatedly doing a mergesort in place. 反复做一个mergesort到位。 This code is on ideone . 此代码在ideone上

This works by first concatenating all the sorted lists together into one long list. 这通过首先将所有已排序的列表连接成一个长列表来工作。 If there were three source lists, this means there are four 'offsets', which are four points in the full list between which the elements are sorted. 如果有三个源列表,这意味着有四个“偏移”,它们是完整列表中的四个点,元素在这四个点之间进行排序。 The algorithm will then pull off three of these at a time, merging the two corresponding adjacent sorted lists into one sorted list, and then remembering two of those three offsets to be used in the new_offsets. 然后,算法将一次拉出其中的三个,将两个相应的相邻排序列表合并为一个排序列表,然后记住要在new_offsets中使用的这三个偏移中的两个。

This repeats in a loop, with pairs of adjacent sorted ranges merged together, until only one sorted range remains. 这在循环中重复,将成对的相邻排序范围合并在一起,直到仅剩下一个排序范围。

Ultimately, I think the best algorithm would involve merging the shortest pairs of adjacent ranges together first. 最终,我认为最好的算法将首先将最短的相邻范围对合并在一起。

// http://stackoverflow.com/questions/9013485/c-how-to-merge-sorted-vectors-into-a-sorted-vector-pop-the-least-element-fro/9048857#9048857
#include <iostream>
#include <vector>
#include <algorithm>
#include <cassert>
using namespace std;

template<typename T, size_t N>
vector<T> array_to_vector( T(*array)[N] ) { // Yes, this works. By passing in the *address* of
                                            // the array, all the type information, including the
                                            // length of the array, is known at compiler. 
        vector<T> v( *array, &((*array)[N]));
        return v;
}   

void merge_sort_many_vectors() {

    /* Set up the Sorted Vectors */ 
    int a1[] = { 2, 10, 100};
    int a2[] = { 5, 15, 90, 200};
    int a3[] = { 12 };

    vector<int> v1  = array_to_vector(&a1);
    vector<int> v2  = array_to_vector(&a2);
    vector<int> v3  = array_to_vector(&a3);


    vector<int> full_vector;
    vector<size_t> offsets;
    offsets.push_back(0);

    full_vector.insert(full_vector.end(), v1.begin(), v1.end());
    offsets.push_back(full_vector.size());
    full_vector.insert(full_vector.end(), v2.begin(), v2.end());
    offsets.push_back(full_vector.size());
    full_vector.insert(full_vector.end(), v3.begin(), v3.end());
    offsets.push_back(full_vector.size());

    assert(full_vector.size() == v1.size() + v2.size() + v3.size());

    cout << "before:\t";
    for(vector<int>::const_iterator v = full_vector.begin(); v != full_vector.end(); ++v) {
            cout << ", " << *v;
    }       
    cout << endl;
    while(offsets.size()>2) {
            assert(offsets.back() == full_vector.size());
            assert(offsets.front() == 0);
            vector<size_t> new_offsets;
            size_t x = 0;
            while(x+2 < offsets.size()) {
                    // mergesort (offsets[x],offsets[x+1]) and (offsets[x+1],offsets[x+2])
                    inplace_merge(&full_vector.at(offsets.at(x))
                                 ,&full_vector.at(offsets.at(x+1))
                                 ,&(full_vector[offsets.at(x+2)]) // this *might* be at the end
                                 );
                    // now they are sorted, we just put offsets[x] and offsets[x+2] into the new offsets.
                    // offsets[x+1] is not relevant any more
                    new_offsets.push_back(offsets.at(x));
                    new_offsets.push_back(offsets.at(x+2));
                    x += 2;
            }
            // if the number of offsets was odd, there might be a dangling offset
            // which we must remember to include in the new_offsets
            if(x+2==offsets.size()) {
                    new_offsets.push_back(offsets.at(x+1));
            }
            // assert(new_offsets.front() == 0);
            assert(new_offsets.back() == full_vector.size());
            offsets.swap(new_offsets);

    }
    cout << "after: \t";
    for(vector<int>::const_iterator v = full_vector.begin(); v != full_vector.end(); ++v) {
            cout << ", " << *v;
    }
    cout << endl;
}

int main() {
        merge_sort_many_vectors();
}

The first thing that springs to mind is to make a heap structure containing iterators to each vector, ordered by the value they currently point at. 首先要考虑的是创建一个包含每个向量的迭代器的堆结构,按它们当前指向的值排序。 (each entry would need to contain the end iterator too, of course) (当然,每个条目也需要包含结束迭代器)

The current element is at the root of the heap, and to advance, you simply either pop it, or increase its key. 当前元素位于堆的根部,要前进,您只需弹出它或增加其键。 (the latter could be done by popping, incrementing, then pushing) (后者可以通过弹出,递增,然后推动来完成)

I believe this should have asymptotic complexity O(E log M) where E is the total number of elements, and M is the number of vectors. 我相信这应该具有渐近复杂度O(E log M) ,其中E是元素的总数, M是向量的数量。

If you are really popping everything out of the vectors, you could make a heap of pointers to your vectors, you may want to treat them as heaps too, to avoid the performance penalty of erasing from the front of a vector. 如果你真的从向量中弹出所有东西,你可以制作一堆指向你的向量的指针,你可能也希望将它们视为堆,以避免从向量前面擦除的性能损失。 (or, you could copy everything into deque s first) (或者,您可以先将所有内容复制到deque


Merging them all together by merging pairs at a time has the same asymptotic complexity if you're careful about the order. 通过一次合并对将它们合并在一起具有相同的渐近复杂性,如果您对订单的谨慎。 If you arrange all of the vectors in a full, balanced binary tree then pairwise merge as you go up the tree, then each element will be copied log M times, also leading to an O(E log M) algorithm. 如果您将所有向量排列在一个完整,平衡的二叉树中,然后在向上树时成对合并,那么每个元素将被复制log M次,也导致O(E log M)算法。

For extra actual efficiency, instead of the tree, you should repeatedly merge the smallest two vectors until you only have one left. 为了获得额外的实际效率,您应该重复合并最小的两个向量,而不是树,直到您只剩下一个向量。 (again, putting pointers to the vectors in a heap is the way to go, but this time ordered by length) (再次,将指针放在堆中的向量是要走的路,但这次按长度排序)

(really, you want to order by "cost to copy" instead of length. An extra thing to optimize for certain value types) (实际上,您希望通过“复制成本”而不是长度进行排序。针对特定值类型进行优化的额外事项)


If I had to guess, the fastest way would be to use the second idea, but with an N-ary merge instead of a pairwise merge, for some suitable N (which I'm guessing will be either a small constant, or roughly the square-root of the number of vectors), and perform the N-ary merge by using the first algorithm above to enumerate the contents of N vectors at once. 如果我不得不猜测,最快的方法是使用第二个想法,但是使用N-ary合并而不是成对合并,对于一些合适的N(我猜测它将是一个小常数,或者大致是(矢量数的平方根),并使用上面的第一算法执行N-ary合并,一次枚举N个矢量的内容。

I've used the algorithm given here and did a little abstracting; 我已经使用了这里给出的算法并进行了一些抽象; converting to templates. 转换为模板。 I've coded this version in VS2010 and used a lambda function instead of the functor. 我在VS2010中编写了这个版本并使用了lambda函数而不是functor。 I don't know if this is in any sense 'better' than the previous version, but maybe it will be useful someone? 我不知道这是否比以前的版本“更好”,但也许它会对某人有用吗?

#include <queue>
#include <vector>

namespace priority_queue_sort
{
    using std::priority_queue;
    using std::pair;
    using std::make_pair;
    using std::vector;

    template<typename T>
    void value_vectors(const vector< vector <T> * >& input_sorted_vectors, vector<T> &output_vector)
    {
        typedef vector<T>::iterator iter;
        typedef pair<iter, iter>    iter_pair;

        static auto greater_than_lambda = [](const iter_pair& p1, const iter_pair& p2) -> bool { return *(p1.first) >  *(p2.first); };

        priority_queue<iter_pair, std::vector<iter_pair>, decltype(greater_than_lambda) > cluster_feeder(greater_than_lambda);

        size_t total_size(0);

        for (auto k = input_sorted_vectors.begin(); k != input_sorted_vectors.end(); ++k)
        {
            cluster_feeder.push( make_pair( (*k)->begin(), (*k)->end() ) );
            total_size += (*k)->size();
        }

        output_vector.resize(total_size);
        total_size = 0;
        iter_pair c_lead;
        while (cluster_feeder.empty() != true)
        {
            c_lead = cluster_feeder.top();
            cluster_feeder.pop();
            output_vector[total_size++] = *(c_lead.first);
            c_lead.first++;
            if (c_lead.first != c_lead.second) cluster_feeder.push(c_lead);
        }
    }

    template<typename U, typename V>
    void pair_vectors(const vector< vector < pair<U, V> > * >& input_sorted_vectors, vector< pair<U, V> > &output_vector)
    {
        typedef vector< pair<U, V> >::iterator iter;
        typedef pair<iter, iter> iter_pair;

        static auto greater_than_lambda = [](const iter_pair& p1, const iter_pair& p2) -> bool { return *(p1.first) >  *(p2.first); };

        priority_queue<iter_pair, std::vector<iter_pair>, decltype(greater_than_lambda) > cluster_feeder(greater_than_lambda);

        size_t total_size(0);

        for (auto k = input_sorted_vectors.begin(); k != input_sorted_vectors.end(); ++k)
        {
            cluster_feeder.push( make_pair( (*k)->begin(), (*k)->end() ) );
            total_size += (*k)->size();
        }

        output_vector.resize(total_size);
        total_size = 0;
        iter_pair c_lead;

        while (cluster_feeder.empty() != true)
        {
            c_lead = cluster_feeder.top();
            cluster_feeder.pop();
            output_vector[total_size++] = *(c_lead.first);  
            c_lead.first++;
            if (c_lead.first != c_lead.second) cluster_feeder.push(c_lead);
        }
    }
}

The algorithm priority_queue_sort::value_vectors sorts vectors containing values only; 算法priority_queue_sort::value_vectors仅包含值的向量priority_queue_sort::value_vectors排序; whereas priority_queue_sort::pair_vectors sorts vectors containing pairs of data according to the first data-element. priority_queue_sort::pair_vectors根据第一个数据元素对包含数据对的向量priority_queue_sort::pair_vectors排序。 Hope someone can use this someday :-) 希望有一天有人可以使用这个:-)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM