Efficiently process each unique permutation of a vector when number of unique elements in vector is much smaller than vector size

Question

In a program I need to apply a function in parallel to each unique permutation of a vector. The size of the vector is around N=15

I already have a function void parallel_for_each_permutation which I can use in combination with a std::set to only process each unique permutation exactly once.

This all works well for the general case. However, in my use case the number of unique elements k per vector is very limited, usually around k=4. This means that I'm currently wasting time constructing the same unique permutation over and over again, just to throw it away because it has already been processed.

Is it possible to process all unique permutations in this special case, without constructing all N! permutations?

Example use-case:

#include <algorithm>
#include <thread>
#include <vector>
#include <mutex>
#include <numeric>
#include <set>
#include <iostream>

template<class Container1, class Container2>
struct Comp{
    //compare element-wise less than
    bool operator()(const Container1& l, const Container2& r) const{
        auto pair = std::mismatch(l.begin(), l.end(), r.begin());
        if(pair.first == l.end() && pair.second == r.end())
            return false;
        return *(pair.first) < *(pair.second);
    }
};

template<class Container, class Func>
void parallel_for_each_permutation(const Container& container, int num_threads, Func func){
    auto ithPermutation = [](int n, size_t i) -> std::vector<size_t>{
        // https://stackoverflow.com/questions/7918806/finding-n-th-permutation-without-computing-others
        std::vector<size_t> fact(n);
        std::vector<size_t> perm(n);

        fact[0] = 1;
        for(int k = 1; k < n; k++)
            fact[k] = fact[k-1] * k;

        for(int k = 0; k < n; k++){
            perm[k] = i / fact[n-1-k];
            i = i % fact[n-1-k];
        }

        for(int k = n-1; k > 0; k--){
            for(int j = k-1; j >= 0; j--){
                if(perm[j] <= perm[k])
                    perm[k]++;
            }
        }

        return perm;
    };

    size_t totalNumPermutations = 1;
    for(size_t i = 1; i <= container.size(); i++)
        totalNumPermutations *= i;

    std::vector<std::thread> threads;

    for(int threadId = 0; threadId < num_threads; threadId++){
        threads.emplace_back([&, threadId](){
            const size_t firstPerm = size_t(float(threadId) * totalNumPermutations / num_threads);
            const size_t last_excl = std::min(totalNumPermutations, size_t(float(threadId+1) * totalNumPermutations / num_threads));

            Container permutation(container);

            auto permIndices = ithPermutation(container.size(), firstPerm);

            size_t count = firstPerm;
            do{
                for(int i = 0; i < int(permIndices.size()); i++){
                    permutation[i] = container[permIndices[i]];
                }

                func(threadId, permutation);
                std::next_permutation(permIndices.begin(), permIndices.end());
                ++count;
            }while(count < last_excl);
        });
    }

    for(auto& thread : threads)
        thread.join();
}

template<class Container, class Func>
void parallel_for_each_unique_permutation(const Container& container, Func func){
    using Comparator = Comp<Container, Container>;
    constexpr int numThreads = 4;

    std::set<Container, Comparator> uniqueProcessedPermutations(Comparator{});
    std::mutex m;

    parallel_for_each_permutation(
        container,
        numThreads,
        [&](int threadId, const auto& permutation){

            {
                std::lock_guard<std::mutex> lg(m);
                if(uniqueProcessedPermutations.count(permutation) > 0){
                    return;
                }else{
                    uniqueProcessedPermutations.insert(permutation);
                }
            }

            func(permutation);
        }
    );
}

int main(){
    std::vector<int> vector1{1,1,1,1,2,3,2,2,3,3,1};

    auto func = [](const auto& vec){return;};

    parallel_for_each_unique_permutation(vector1, func);
}

Answer 1

The permutations you have to work with are known in the field of combinatorics as multiset permutations .

They are described for example on The Combinatorial Object Server with more detailed explanations in this paper by professor Tadao Takaoka .

You have some related Python code and some C++ code in the FXT open source library .

You might consider adding the "multiset" and "combinatorics" tags to your question.

One possibility is to borrow the (header-only) algorithmic code from the FXT library, which provides a simple generator class for those multiset permutations .

Performance level:

Using the FXT algorithm on a test vector of 15 objects, {1,1,1, 2,2,2, 3,3,3,3, 4,4,4,4,4} , one can generate all associated 12,612,600 "permutations" in less than 2 seconds on a plain vanilla Intel x86-64 machine; this is without diagnostics text I/O and without any attempt at optimization.

The algorithm generates exactly those "permutations" that are required, nothing more. So there is no longer a need to generate all 15! "raw" permutations nor to use mutual exclusion to update a shared data structure for filtering purposes.

An adaptor class for generating the permutations:

I will try below to provide code for an adaptor class, which allows your application to use the FXT algorithm while containing the dependency into a single implementation file. That way, the code will hopefully fit better into your application. Think FXT's ulong type and use of raw pointers, versus std::vector<std::size_t> in your code. Besides, FXT is a very extensive library.

Header file for the "adaptor" class:

// File:  MSetPermGen.h

#ifndef  MSET_PERM_GEN_H
#define  MSET_PERM_GEN_H

#include  <iostream>
#include  <vector>

class MSetPermGenImpl;  // from algorithmic backend

using  IntVec  = std::vector<int>;
using  SizeVec = std::vector<std::size_t>;

// Generator class for multiset permutations:

class MSetPermGen {
public:
    MSetPermGen(const IntVec& vec);

    std::size_t       getCycleLength() const;
    bool              forward(size_t incr);
    bool              next();
    const SizeVec&    getPermIndices() const;
    const IntVec&     getItems() const;
    const IntVec&     getItemValues() const;

private: 
    std::size_t       cycleLength_;
    MSetPermGenImpl*  genImpl_;         // implementation generator
    IntVec            itemValues_;      // only once each
    IntVec            items_;           // copy of ctor argument
    SizeVec           freqs_;           // repetition counts
    SizeVec           state_;           // array of indices in 0..n-1
};

#endif

The class constructor takes exactly the argument type provided in your main program. Of course, the key method is next() . You can also move the automaton by several steps at once using the forward(incr) method.

Example client program:

// File:  test_main.cpp

#include  <cassert>
#include  "MSetPermGen.h"

using  std::cout;
using  std::cerr;
using  std::endl;

// utility functions:

std::vector<int>  getMSPermutation(const MSetPermGen& mspg)
{
    std::vector<int>  res;
    auto indices = mspg.getPermIndices();  // always between 0 and n-1
    auto values  = mspg.getItemValues();  // whatever the user put in

    std::size_t n = indices.size();
    assert( n == items.size() );
    res.reserve(n);

    for (std::size_t i=0; i < n; i++) {
        auto xi = indices[i];
        res.push_back(values[xi]);
    }

    return res;
}

void printPermutation(const std::vector<int>& p, std::ostream& fh)
{
    std::size_t n = p.size();

    for (size_t i=0; i < n; i++)
        fh << p[i] << " ";
    fh << '\n';
}

int main(int argc, const char* argv[])
{
    std::vector<int>  vec0{1,1, 2,2,2};                        // N=5
    std::vector<int>  vec1{1,1, 1,1, 2, 3, 2,2, 3,3, 1};       // N=11
    std::vector<int>  vec2{1,1,1, 2,2,2, 3,3,3,3, 4,4,4,4,4};  // N=15

    MSetPermGen  pg0{vec0};
    MSetPermGen  pg1{vec1};
    MSetPermGen  pg2{vec2};

    auto pg = &pg0;  // choice of 0, 1, 2 for sizing
    auto cl = pg->getCycleLength();

    auto permA = getMSPermutation(*pg);
    printPermutation(permA, cout);
    for (std::size_t pi=0; pi < (cl-1); pi++) {
        pg->next();
        auto permB = getMSPermutation(*pg);
        printPermutation(permB, cout);
    }

    return EXIT_SUCCESS;
}

Text output from the above small program:

You get only 10 items from vector {1,1, 2,2,2}, because 5! / (2! * 3!) = 120/(2*6) = 10.

The implementation file for the adaptor class, MSetPermGen.cpp , consists of two parts. The first part is FXT code with minimal adaptations. The second part is the MSetPermGen class proper.

First part of implementation file:

// File:  MSetPermGen.cpp - part 1 of 2 - FXT code

// -------------- Beginning  of header-only FXT combinatorics code -----------

 // This file is part of the FXT library.
 // Copyright (C) 2010, 2012, 2014 Joerg Arndt
 // License: GNU General Public License version 3 or later,
 // see the file COPYING.txt in the main directory.

//--  https://www.jjj.de/fxt/ 
//--  https://fossies.org/dox/fxt-2018.07.03/mset-perm-lex_8h_source.html

#include  <cstddef>
using ulong = std::size_t;

inline void  swap2(ulong& xa, ulong& xb)
{
    ulong  save_xb = xb;

    xb = xa;
    xa = save_xb;
}

class mset_perm_lex
 // Multiset permutations in lexicographic order, iterative algorithm.
 {
 public:
     ulong k_;    // number of different sorts of objects
     ulong *r_;   // number of elements '0' in r[0], '1' in r[1], ..., 'k-1' in r[k-1]
     ulong n_;    // number of objects
     ulong *ms_;  // multiset data in ms[0], ..., ms[n-1], sentinels at [-1] and [-2]

 private:  // have pointer data
     mset_perm_lex(const mset_perm_lex&);  // forbidden
     mset_perm_lex & operator = (const mset_perm_lex&);  // forbidden

 public:
     explicit mset_perm_lex(const ulong *r, ulong k)
     {
         k_ = k;
         r_ = new ulong[k];
         for (ulong j=0; j<k_; ++j)  r_[j] = r[j];  // get buckets

         n_ = 0;
         for (ulong j=0; j<k_; ++j)  n_ += r_[j];
         ms_ = new ulong[n_+2];
         ms_[0] = 0; ms_[1] = 1;  // sentinels:  ms[0] < ms[1]
         ms_ += 2;  // nota bene

         first();
     }

     void first()
     {
         for (ulong j=0, i=0;  j<k_;  ++j)
             for (ulong h=r_[j];  h!=0;  --h, ++i)
                 ms_[i] = j;
     }

     ~mset_perm_lex()
     {
         ms_ -= 2;
         delete [] ms_;
         delete [] r_;
     }

     const ulong * data()  const { return ms_; }

     ulong next()
     // Return position of leftmost change,
     // return n with last permutation.
     {
         // find rightmost pair with ms[i] < ms[i+1]:
         const ulong n1 = n_ - 1;
         ulong i = n1;
         do  { --i; }  while ( ms_[i] >= ms_[i+1] );  // can read sentinel
         if ( (long)i < 0 )  return n_;  // last sequence is falling seq.

         // find rightmost element ms[j] less than ms[i]:
         ulong j = n1;
         while ( ms_[i] >= ms_[j] )  { --j; }

         swap2(ms_[i], ms_[j]);

         // Here the elements ms[i+1], ..., ms[n-1] are a falling sequence.
         // Reverse order to the right:
         ulong r = n1;
         ulong s = i + 1;
         while ( r > s )  { swap2(ms_[r], ms_[s]);  --r;  ++s; }

         return i;
     } 
 };

// -------------- End of header-only FXT combinatorics code -----------

Second part of the class implementation file:

// Second part of file MSetPermGen.cpp: non-FXT code

#include  <cassert>
#include  <tuple>
#include  <map>
#include  <iostream>
#include  <cstdio>

#include  "MSetPermGen.h"

using  std::cout;
using  std::cerr;
using  std::endl;

class MSetPermGenImpl {  // wrapper class
public:
    MSetPermGenImpl(const SizeVec& freqs) : fg(freqs.data(), freqs.size())
    {}
private:
    mset_perm_lex   fg;

    friend class MSetPermGen;
};

static std::size_t  fact(size_t n)
{
    std::size_t  f = 1;

    for (std::size_t i = 1; i <= n; i++)
        f = f*i;
    return f;
}

MSetPermGen::MSetPermGen(const IntVec& vec) : items_(vec)
{
    std::map<int,int>  ma;

    for (int i: vec) {
        ma[i]++;
    }
    int item, freq;
    for (const auto& p : ma) {
       std::tie(item, freq) = p;
       itemValues_.push_back(item);
       freqs_.push_back(freq);
    }
    cycleLength_ = fact(items_.size());
    for (auto i: freqs_)
        cycleLength_ /= fact(i);

    // create FXT-level generator:
    genImpl_ = new MSetPermGenImpl(freqs_);
    for (std::size_t i=0; i < items_.size(); i++)
        state_.push_back(genImpl_->fg.ms_[i]);
}

std::size_t  MSetPermGen::getCycleLength() const
{
    return cycleLength_;
}

bool  MSetPermGen::forward(size_t incr)
{
    std::size_t  n  = items_.size();
    std::size_t  rc = 0;

    // move forward state by brute force, could be improved:
    for (std::size_t i=0; i < incr; i++) 
        rc = genImpl_->fg.next();

    for (std::size_t j=0; j < n; j++)
        state_[j] = genImpl_->fg.ms_[j];
    return (rc != n);
}

bool  MSetPermGen::next()
{
    return forward(1);
}

const SizeVec&  MSetPermGen::getPermIndices() const
{
    return (this->state_);
}

const IntVec&  MSetPermGen::getItems() const
{
    return (this->items_);
}

const IntVec&  MSetPermGen::getItemValues() const
{
    return (this->itemValues_);
}

Adapting the parallel application:

Regarding your multithreaded application, given that generating the "permutations" is cheap, you can afford to create one generator object per thread.

Before launching the actual computation, you forward each generator to its appropriate initial position, that is at step thread_id * (cycleLength / num_threads) .

I have tried to adapt your code to this MSetPermGen class along these lines. See code below.

With 3 threads, an input vector {1,1,1, 2,2,2, 3,3,3,3, 4,4,4,4,4} of size 15 (giving 12,612,600 permutations) and all diagnostics enabled, your modified parallel program runs in less than 10 seconds; less than 2 seconds with all diagnostics switched off.

Modified parallel program:

#include  <algorithm>
#include  <thread>
#include  <vector>
#include  <atomic>
#include  <mutex>
#include  <numeric>
#include  <set>
#include  <iostream>
#include  <fstream>
#include  <sstream>
#include  <cstdlib>

#include  "MSetPermGen.h"

using  std::cout;
using  std::endl;

// debug and instrumentation:
static std::atomic<size_t>  permCounter;
static bool doManagePermCounter = true;
static bool doThreadLogfiles    = true;
static bool doLogfileHeaders    = true;

template<class Container, class Func>
void parallel_for_each_permutation(const Container& container, int numThreads, Func mfunc) {

    MSetPermGen  gen0(container);
    std::size_t totalNumPermutations = gen0.getCycleLength();
    std::size_t permShare = totalNumPermutations / numThreads;
    if ((totalNumPermutations % numThreads) != 0)
        permShare++;
    std::cout << "totalNumPermutations: " << totalNumPermutations << std::endl;

    std::vector<std::thread>  threads;

    for (int threadId = 0; threadId < numThreads; threadId++) {
        threads.emplace_back([&, threadId]() {

            // generate some per-thread logfile name
            std::ostringstream  fnss;
            fnss << "thrlog_" << threadId << ".txt";
            std::string    fileName = fnss.str();
            std::ofstream  fh(fileName);

            MSetPermGen  thrGen(container);
            const std::size_t firstPerm = permShare * threadId;
            thrGen.forward(firstPerm);

            const std::size_t last_excl = std::min(totalNumPermutations,
                                             (threadId+1) * permShare);

            if (doLogfileHeaders) {
                fh << "MSG threadId: "  << threadId << '\n';
                fh << "MSG firstPerm: " << firstPerm << '\n';
                fh << "MSG lastExcl : " << last_excl << '\n';
            }

            Container permutation(container);            
            auto values      = thrGen.getItemValues();
            auto permIndices = thrGen.getPermIndices();
            auto nsz         = permIndices.size();

            std::size_t count = firstPerm;
            do {
                for (std::size_t i = 0; i < nsz; i++) {
                    permutation[i] = values[permIndices[i]];
                }

                mfunc(threadId, permutation);

                if (doThreadLogfiles) {
                    for (std::size_t i = 0; i < nsz; i++)
                        fh << permutation[i] << ' ';
                    fh << '\n';
                }
                thrGen.next();
                permIndices = thrGen.getPermIndices();
                ++count;
                if (doManagePermCounter) {
                    permCounter++;
                }
            } while (count < last_excl);

            fh.close();
        });
    }

    for(auto& thread : threads)
        thread.join();
}

template<class Container, class Func>
void parallel_for_each_unique_permutation(const Container& container, Func func) {
    constexpr int numThreads = 3;

    parallel_for_each_permutation(
        container,
        numThreads,
        [&](int threadId, const auto& permutation){
            // no longer need any mutual exclusion
            func(permutation);
        }
    );
}


int main()
{
    std::vector<int>  vector1{1,1,1,1,2,3,2,2,3,3,1};             // N=11
    std::vector<int>  vector0{1,1, 2,2,2};                        // N=5
    std::vector<int>  vector2{1,1,1, 2,2,2, 3,3,3,3, 4,4,4,4,4};  // N=15

    auto func = [](const auto& vec) { return; };

    permCounter.store(0);

    parallel_for_each_unique_permutation(vector2, func);

    auto finalPermCounter = permCounter.load();
    cout << "FinalPermCounter = " << finalPermCounter << endl;

}

Answer 2

Yes, the <algorithm> -header includes general algorithms for permutations, namely std::next_permutation() , std::prev_permutation() , and std::is_permutation() .

Admittedly, none of them give you the i-th permutation directly, you have to iterate them.

Unless you find a way to efficiently calculate the i-th permutation, consider using the same code for partitioning:
Separate the elements into the k with the rarest value, and the rest, and use that to partition the work.

Efficiently process each unique permutation of a vector when number of unique elements in vector is much smaller than vector size

Question

2 answers

solution1
1 ACCPTED 2019-08-17 23:03:44

Performance level:

An adaptor class for generating the permutations:

Example client program:

Text output from the above small program:

First part of implementation file:

Second part of the class implementation file:

Adapting the parallel application:

Modified parallel program:

solution2
0 2019-08-03 15:49:48

Efficiently process each unique permutation of a vector when number of unique elements in vector is much smaller than vector size

Question

2 answers

solution1 1 ACCPTED 2019-08-17 23:03:44

Performance level:

An adaptor class for generating the permutations:

Example client program:

Text output from the above small program:

First part of implementation file:

Second part of the class implementation file:

Adapting the parallel application:

Modified parallel program:

solution2 0 2019-08-03 15:49:48

solution1
1 ACCPTED 2019-08-17 23:03:44

solution2
0 2019-08-03 15:49:48