简体   繁体   中英

number to unique permutation mapping of a sequence containing duplicates

I am looking for an algorithm that can map a number to a unique permutation of a sequence. I have found out about Lehmer codes and the factorial number system thanks to a similar question, Fast permutation -> number -> permutation mapping algorithms , but that question doesn't deal with the case where there are duplicate elements in the sequence.

For example, take the sequence 'AAABBC'. There are 6! = 720 ways that could be arranged, but I believe there are only 6! / (3! * 2! * 1!) = 60 unique permutation of this sequence. How can I map a number to a permutation in these cases?

Edit: changed the term 'set' to 'sequence'.

From Permutation to Number:

Let K be the number of character classes (example: AAABBC has three character classes)

Let N[K] be the number of elements in each character class. (example: for AAABBC, we have N[K]=[3,2,1], and let N= sum(N[K])

Every legal permutation of the sequence then uniquely corresponds to a path in an incomplete K-way tree.

The unique number of the permutation then corresponds to the index of the tree-node in a post-order traversal of the K-ary tree terminal nodes .

Luckily, we don't actually have to perform the tree traversal -- we just need to know how many terminal nodes in the tree are lexicographically less than our node. This is very easy to compute, as at any node in the tree, the number terminal nodes below the current node is equal to the number of permutations using the unused elements in the sequence, which has a closed form solution that is a simple multiplication of factorials.

So given our 6 original letters, and the first element of our permutation is a 'B', we determine that there will be 5!/3!1!1! = 20 elements that started with 'A', so our permutation number has to be greater than 20. Had our first letter been a 'C', we could have calculated it as 5!/2!2!1! (not A) + 5!/3!1!1! (not B) = 30+ 20, or alternatively as 60 (total) - 5!/3!2!0! (C) = 50

Using this, we can take a permutation (eg 'BAABCA') and perform the following computations: Permuation #= (5!/2!2!1!) ('B') + 0('A') + 0('A')+ 3!/1!1!1! ('B') + 2!/1!

= 30 + 3 +2 = 35

Checking that this works: CBBAAA corresponds to

(5!/2!2!1! (not A) + 5!/3!1!1! (not B)) 'C'+ 4!/2!2!0! (not A) 'B' + 3!/2!1!0! (not A) 'B' = (30 + 20) +6 + 3 = 59

Likewise, AAABBC = 0 ('A') + 0 'A' + '0' A' + 0 'B' + 0 'B' + 0 'C = 0

Sample implementation:

import math
import copy
from operator import mul

def computePermutationNumber(inPerm, inCharClasses):
    permutation=copy.copy(inPerm)
    charClasses=copy.copy(inCharClasses)

    n=len(permutation)
    permNumber=0
    for i,x in enumerate(permutation):
        for j in xrange(x):
            if( charClasses[j]>0):
                charClasses[j]-=1
                permNumber+=multiFactorial(n-i-1, charClasses)
                charClasses[j]+=1
        if charClasses[x]>0:
            charClasses[x]-=1
    return permNumber

def multiFactorial(n, charClasses):
    val= math.factorial(n)/ reduce(mul, (map(lambda x: math.factorial(x), charClasses)))
    return val

From Number to Permutation: This process can be done in reverse, though I'm not sure how efficiently: Given a permutation number, and the alphabet that it was generated from, recursively subtract the largest number of nodes less than or equal to the remaining permutation number.

Eg Given a permutation number of 59, we first can subtract 30 + 20 = 50 ('C') leaving 9. Then we can subtract 'B' (6) and a second 'B'(3), re-generating our original permutation.

Here is an algorithm in Java that enumerates the possible sequences by mapping an integer to the sequence.

public class Main {

    private int[] counts = { 3, 2, 1 }; // 3 Symbols A, 2 Symbols B, 1 Symbol C
    private int n = sum(counts);

    public static void main(String[] args) {
        new Main().enumerate();
    }

    private void enumerate() {
        int s = size(counts);
        for (int i = 0; i < s; ++i) {
            String p = perm(i);
            System.out.printf("%4d -> %s\n", i, p);
        }

    }

    // calculates the total number of symbols still to be placed
    private int sum(int[] counts) {
        int n = 0;
        for (int i = 0; i < counts.length; i++) {
            n += counts[i];
        }
        return n;
    }

    // calculates the number of different sequences with the symbol configuration in counts
    private int size(int[] counts) {
        int res = 1;
        int num = 0;
        for (int pos = 0; pos < counts.length; pos++) {
            for (int den = 1; den <= counts[pos]; den++) {
                res *= ++num;
                res /= den;
            }
        }
        return res;
    }

    // maps the sequence number to a sequence
    private String perm(int num) {
        int[] counts = this.counts.clone();
        StringBuilder sb = new StringBuilder(n);
        for (int i = 0; i < n; ++i) {
            int p = 0;
            for (;;) {
                while (counts[p] == 0) {
                    p++;
                }
                counts[p]--;
                int c = size(counts);
                if (c > num) {
                    sb.append((char) ('A' + p));
                    break;
                }
                counts[p]++;
                num -= c;
                p++;
            }
        }
        return sb.toString();
    }

}

The mapping used by the algorithm is as follows. I use the example given in the question (3 x A, 2 x B, 1 x C) to illustrate it.

There are 60 (=6!/3!/2!/1!) possible sequences in total, 30 (=5!/2!/2!/1!) of them have an A at the first place, 20 (=5!/3!/1!/1!) have a B at the first place, and 10 (=5!/3!/2!/0!) have a C at the first place.

The numbers 0..29 are mapped to all sequences starting with an A , 30..49 are mapped to the sequences starting with B , and 50..59 are mapped to the sequences starting with C .

The same process is repeated for the next place in the sequence, for example if we take the sequences starting with B we have now to map numbers 0 (=30-30) .. 19 (=49-30) to the sequences with configuration (3 x A, 1 x B, 1 x C)

A very simple algorithm to mapping a number for a permutation consists of n digits is

number<-digit[0]*10^(n-1)+digit[1]*10^(n-2)+...+digit[n]*10^0

You can find plenty of resources for algorithms to generate permutations. I guess you want to use this algorithm in bioinformatics. For example you can use itertools.permutations from Python.

Assuming the resulting number fits inside a word (eg 32 or 64 bit integer) relatively easily, then much of the linked article still applies. Encoding and decoding from a variable base remains the same. What changes is how the base varies.

If you're creating a permutation of a sequence, you pick an item out of your bucket of symbols (from the original sequence) and put it at the start. Then you pick out another item from your bucket of symbols and put it on the end of that. You'll keep picking and placing symbols at the end until you've run out of symbols in your bucket.

What's significant is which item you picked out of the bucket of the remaining symbols each time. The number of remaining symbols is something you don't have to record because you can compute that as you build the permutation -- that's a result of your choices, not the choices themselves.

The strategy here is to record what you chose, and then present an array of what's left to be chosen. Then choose, record which index you chose (packing it via the variable base method), and repeat until there's nothing left to choose. (Just as above when you were building a permuted sequence.)

In the case of duplicate symbols it doesn't matter which one you picked, so you can treat them as the same symbol. The difference is that when you pick a symbol which still has a duplicate left, you didn't reduce the number of symbols in the bucket to pick from next time.

Let's adopt a notation that makes this clear:

Instead of listing duplicate symbols left in our bucket to choose from like cabcaa we'll list them along with how many are still in the bucket: c-2 a-3 b-1 .

Note that if you pick c from the list, the bucket has c-1 a-3 b-1 left in it. That means next time we pick something, we have three choices.

But on the other hand, if I picked b from the list, the bucket has c-2 a-3 left in it. That means next time we pick something, we only have two choices.

When reconstructing the permuted sequence we just maintain the bucket the same way as when we were computing the permutation number.

The implementation details aren't trivial, but they're straightforward with standard algorithms. The only thing that might heckle you is what to do when a symbol in your bucket is no longer available.

Suppose your bucket was represented by a list of pairs (like above): c-1 a-3 b-1 and you choose c . Your resulting bucket is c-0 a-3 b-1 . But c-0 is no longer a choice, so your list should only have two entries, not three. You could move the entire list down by 1 resulting in a-3 b-1 , but if your list is long this is expensive. A fast an easy solution: move the last element of the bucket into the removed location and decrease your bucket size: c0 a-3 b-1 becomes b-1 a-3 <empty> or just b-1 a-3 .

Note that we can do the above because it doesn't matter what order the symbols in the bucket are listed in, as long as it's the same way when we encode or decode the number.

As I was unsure of the code in g bronner's answer (or of my understanding), I recoded it in R as follows

ritpermz=function(n, parclass){
    return(factorial(n) / prod(factorial(parclass)))}

rankum <- function(confg, parclass){
    n=length(confg)
    permdex=1
    for (i in 1:(n-1)){
      x=confg[i]
      if (x > 1){
        for (j in 1:(x-1)){
            if(parclass[j] > 0){
                parclass[j]=parclass[j]-1
                permdex=permdex + ritpermz(n-i, parclass)
                parclass[j]=parclass[j]+1}}}
        parclass[x]=parclass[x]-1
    }#}
    return(permdex)
}

which does produce a ranking with the right range of integers

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM