简体   繁体   中英

Checking if two strings are permutations of each other in Python

I'm checking if two strings a and b are permutations of each other, and I'm wondering what the ideal way to do this is in Python. From the Zen of Python, "There should be one -- and preferably only one -- obvious way to do it," but I see there are at least two ways:

sorted(a) == sorted(b)

and

all(a.count(char) == b.count(char) for char in a)

but the first one is slower when (for example) the first char of a is nowhere in b , and the second is slower when they are actually permutations.

Is there any better (either in the sense of more Pythonic, or in the sense of faster on average) way to do it? Or should I just choose from these two depending on which situation I expect to be most common?

Here is a way which is O(n), asymptotically better than the two ways you suggest.

import collections

def same_permutation(a, b):
    d = collections.defaultdict(int)
    for x in a:
        d[x] += 1
    for x in b:
        d[x] -= 1
    return not any(d.itervalues())

## same_permutation([1,2,3],[2,3,1])
#. True

## same_permutation([1,2,3],[2,3,1,1])
#. False

"but the first one is slower when (for example) the first char of a is nowhere in b".

This kind of degenerate-case performance analysis is not a good idea. It's a rat-hole of lost time thinking up all kinds of obscure special cases.

Only do the O -style "overall" analysis.

Overall, the sorts are O ( n log( n ) ).

The a.count(char) for char in a solution is O ( n 2 ). Each count pass is a full examination of the string.

If some obscure special case happens to be faster -- or slower, that's possibly interesting. But it only matters when you know the frequency of your obscure special cases. When analyzing sort algorithms, it's important to note that a fair number of sorts involve data that's already in the proper order (either by luck or by a clever design), so sort performance on pre-sorted data matters.

In your obscure special case ("the first char of a is nowhere in b") is this frequent enough to matter? If it's just a special case you thought of, set it aside. If it's a fact about your data, then consider it.

heuristically you're probably better to split them off based on string size.

Pseudocode:

returnvalue = false
if len(a) == len(b)
   if len(a) < threshold
      returnvalue = (sorted(a) == sorted(b))
   else
       returnvalue = naminsmethod(a, b)
return returnvalue

If performance is critical, and string size can be large or small then this is what I'd do.

It's pretty common to split things like this based on input size or type. Algorithms have different strengths or weaknesses and it would be foolish to use one where another would be better... In this case Namin's method is O(n), but has a larger constant factor than the O(n log n) sorted method.

I think the first one is the "obvious" way. It is shorter, clearer, and likely to be faster in many cases because Python's built-in sort is highly optimized.

Your second example won't actually work:

all(a.count(char) == b.count(char) for char in a)

will only work if b does not contain extra characters not in a. It also does duplicate work if the characters in string a repeat.

If you want to know whether two strings are permutations of the same unique characters, just do:

set(a) == set(b)

To correct your second example:

all(str1.count(char) == str2.count(char) for char in set(a) | set(b))

set() objects overload the bitwise OR operator so that it will evaluate to the union of both sets. This will make sure that you will loop over all the characters of both strings once for each character only.

That said, the sorted() method is much simpler and more intuitive, and would be what I would use.

Here are some timed executions on very small strings, using two different methods:
1. sorting
2. counting (specifically the original method by @namin).

a, b, c = 'confused', 'unfocused', 'foncused'

sort_method = lambda x,y: sorted(x) == sorted(y)

def count_method(a, b):
    d = {}
    for x in a:
        d[x] = d.get(x, 0) + 1
    for x in b:
        d[x] = d.get(x, 0) - 1
    for v in d.itervalues():
        if v != 0:
            return False
    return True

Average run times of the 2 methods over 100,000 loops are:

non-match (string a and b)

$ python -m timeit -s 'import temp' 'temp.sort_method(temp.a, temp.b)'
100000 loops, best of 3: 9.72 usec per loop
$ python -m timeit -s 'import temp' 'temp.count_method(temp.a, temp.b)'
10000 loops, best of 3: 28.1 usec per loop

match (string a and c)

$ python -m timeit -s 'import temp' 'temp.sort_method(temp.a, temp.c)'
100000 loops, best of 3: 9.47 usec per loop
$ python -m timeit -s 'import temp' 'temp.count_method(temp.a, temp.c)'
100000 loops, best of 3: 24.6 usec per loop

Keep in mind that the strings used are very small. The time complexity of the methods are different, so you'll see different results with very large strings. Choose according to your data, you may even use a combination of the two.

Sorry that my code is not in Python, I have never used it, but I am sure this can be easily translated into python. I believe this is faster than all the other examples already posted. It is also O(n), but stops as soon as possible:

public boolean isPermutation(String a, String b) {
    if (a.length() != b.length()) {
        return false;
    }

    int[] charCount = new int[256];
    for (int i = 0; i < a.length(); ++i) {
        ++charCount[a.charAt(i)];
    }

    for (int i = 0; i < b.length(); ++i) {
        if (--charCount[b.charAt(i)] < 0) {
            return false;
        }
    }
    return true;
}

First I don't use a dictionary but an array of size 256 for all the characters. Accessing the index should be much faster. Then when the second string is iterated, I immediately return false when the count gets below 0. When the second loop has finished, you can be sure that the strings are a permutation, because the strings have equal length and no character was used more often in b compared to a.

Here's martinus code in python. It only works for ascii strings:

def is_permutation(a, b):
    if len(a) != len(b):
        return False

    char_count = [0] * 256
    for c in a:
        char_count[ord(c)] += 1

    for c in b:
        char_count[ord(c)] -= 1
        if char_count[ord(c)] < 0:
            return False

    return True

I did a pretty thorough comparison in Java with all words in a book I had. The counting method beats the sorting method in every way. The results:

Testing against 9227 words.

Permutation testing by sorting ... done.        18.582 s
Permutation testing by counting ... done.       14.949 s

If anyone wants the algorithm and test data set, comment away.

First, for solving such problems, eg whether String 1 and String 2 are exactly the same or not, easily, you can use an "if" since it is O(1). Second, it is important to consider that whether they are only numerical values or they can be also words in the string. If the latter one is true (words and numerical values are in the string at the same time), your first solution will not work. You can enhance it by using "ord()" function to make it ASCII numerical value. However, in the end, you are using sort; therefore, in the worst case your time complexity will be O(NlogN). This time complexity is not bad. But, you can do better. You can make it O(N). My "suggestion" is using Array(list) and set at the same time. Note that finding a value in Array needs iteration so it's time complexity is O(N), but searching a value in set (which I guess it is implemented with HashTable in Python, I'm not sure) has O(1) time complexity:

def Permutation2(Str1, Str2):

ArrStr1 = list(Str1) #convert Str1 to array
SetStr2 = set(Str2) #convert Str2 to set

ArrExtra = []

if len(Str1) != len(Str2): #check their length
    return False

elif Str1 == Str2: #check their values
    return True

for x in xrange(len(ArrStr1)):
    ArrExtra.append(ArrStr1[x])

for x in xrange(len(ArrExtra)): #of course len(ArrExtra) == len(ArrStr1) ==len(ArrStr2)
    if ArrExtra[x] in SetStr2: #checking in set is O(1)
        continue
    else:
        return False

return True

This is derived from @patros' answer .

from collections import Counter

def is_anagram(a, b, threshold=1000000):
    """Returns true if one sequence is a permutation of the other.

    Ignores whitespace and character case.
    Compares sorted sequences if the length is below the threshold,
    otherwise compares dictionaries that contain the frequency of the
    elements.
    """
    a, b = a.strip().lower(), b.strip().lower()
    length_a, length_b = len(a), len(b)
    if length_a != length_b:
        return False
    if length_a < threshold:
        return sorted(a) == sorted(b)
    return Counter(a) == Counter(b)  # Or use @namin's method if you don't want to create two dictionaries and don't mind the extra typing.

In Python 3.1/2.7 you can just use collections.Counter(a) == collections.Counter(b) .

But sorted(a) == sorted(b) is still the most obvious IMHO. You are talking about permutations - changing order - so sorting is the obvious operation to erase that difference.

Go with the first one - it's much more straightforward and easier to understand. If you're actually dealing with incredibly large strings and performance is a real issue, then don't use Python, use something like C.

As far as the Zen of Python is concerned, that there should only be one obvious way to do things refers to small, simple things. Obviously for any sufficiently complicated task, there will always be zillions of small variations on ways to do it.

This is an O(n) solution in Python using hashing with dictionaries. Notice that I don't use default dictionaries because the code can stop this way if we determine the two strings are not permutations after checking the second letter for instance.

def if_two_words_are_permutations(s1, s2):
    if len(s1) != len(s2):
        return False
    dic = {}
for ch in s1:
    if ch in dic.keys():
        dic[ch] += 1
    else:
        dic[ch] = 1
for ch in s2:
    if not ch in dic.keys():
        return False
    elif dic[ch] == 0:
        return False
    else:
        dic[ch] -= 1
return True

In Swift (or another languages implementation), you could look at the encoded values ( in this case Unicode) and see if they match.

Something like:

let string1EncodedValues = "Hello".unicodeScalars.map() {
//each encoded value
$0
//Now add the values
}.reduce(0){ total, value in
    total + value.value
}

let string2EncodedValues = "oellH".unicodeScalars.map() {
    $0
    }.reduce(0) { total, value in
    total + value.value
}

let equalStrings = string1EncodedValues == string2EncodedValues ? true : false

You will need to handle spaces and cases as needed.

This is a PHP function I wrote about a week ago which checks if two words are anagrams. How would this compare (if implemented the same in python) to the other methods suggested? Comments?

public function is_anagram($word1, $word2) {
    $letters1 = str_split($word1);
    $letters2 = str_split($word2);
    if (count($letters1) == count($letters2)) {
        foreach ($letters1 as $letter) {
            $index = array_search($letter, $letters2);
            if ($index !== false) {
                unset($letters2[$index]);
            }
            else { return false; }
        }
        return true;
    }
    return false;        
}

Here's a literal translation to Python of the PHP version (by JFS):

def is_anagram(word1, word2):
    letters2 = list(word2)
    if len(word1) == len(word2):
       for letter in word1:
           try:
               del letters2[letters2.index(letter)]
           except ValueError:
               return False               
       return True
    return False

Comments:

1. The algorithm is O(N**2). Compare it to @namin's version (it is O(N)).
    2. The multiple returns in the function look horrible.

This version is faster than any examples presented so far except it is 20% slower than sorted(x) == sorted(y) for short strings. It depends on use cases but generally 20% performance gain is insufficient to justify a complication of the code by using different version for short and long strings (as in @patros's answer).

It doesn't use len so it accepts any iterable therefore it works even for data that do not fit in memory eg, given two big text files with many repeated lines it answers whether the files have the same lines (lines can be in any order).

def isanagram(iterable1, iterable2):
    d = {}
    get = d.get
    for c in iterable1:
        d[c] = get(c, 0) + 1
    try:
        for c in iterable2:
            d[c] -= 1
        return not any(d.itervalues())
    except KeyError:
        return False

It is unclear why this version is faster then defaultdict (@namin's) one for large iterable1 (tested on 25MB thesaurus).

If we replace get in the loop by try: ... except KeyError then it performs 2 times slower for short strings ie when there are few duplicates.

def matchPermutation(s1, s2):
  a = []
  b = []

  if len(s1) != len(s2):
    print 'length should be the same'
  return

  for i in range(len(s1)):
    a.append(s1[i])

  for i in range(len(s2)):
    b.append(s2[i])

  if set(a) == set(b):
    print 'Permutation of each other'
  else:
    print 'Not a permutation of each other'
  return

#matchPermutaion('rav', 'var') #returns True
matchPermutaion('rav', 'abc') #returns False

Checking if two strings are permutations of each other in Python

# First method
def permutation(s1,s2):
 if len(s1) != len(s2):return False;
 return ' '.join(sorted(s1)) == ' '.join(sorted(s2))


# second method
def permutation1(s1,s2):
 if len(s1) != len(s2):return False;
 array = [0]*128;
 for c in s1:
 array[ord(c)] +=1
 for c in s2:
   array[ord(c)] -=1
   if (array[ord(c)]) < 0:
     return False
 return True

How about something like this. Pretty straight-forward and readable. This is for strings since the as per the OP.

Given that the complexity of sorted() is O(n log n).

def checkPermutation(a,b):
    # input: strings a and b
    # return: boolean true if a is Permutation of b

    if len(a) != len(b):
        return False
    else:
        s_a = ''.join(sorted(a))
        s_b = ''.join(sorted(b))
        if s_a == s_b:
            return True
        else:
            return False

# test inputs
a = 'sRF7w0qbGp4fdgEyNlscUFyouETaPHAiQ2WIxzohiafEGJLw03N8ALvqMw6reLN1kHRjDeDausQBEuIWkIBfqUtsaZcPGoqAIkLlugTxjxLhkRvq5d6i55l4oBH1QoaMXHIZC5nA0K5KPBD9uIwa789sP0ZKV4X6'
b = 'Vq3EeiLGfsAOH2PW6skMN8mEmUAtUKRDIY1kow9t1vIEhe81318wSMICGwf7Rv2qrLrpbeh8bh4hlRLZXDSMyZJYWfejLND4u9EhnNI51DXcQKrceKl9arWqOl7sWIw3EBkeu7Fw4TmyfYwPqCf6oUR0UIdsAVNwbyyiajtQHKh2EKLM1KlY6NdvQTTA7JKn6bLInhFvwZ4yKKbzkgGhF3Oogtnmzl29fW6Q2p0GPuFoueZ74aqlveGTYc0zcXUJkMzltzohoRdMUKP4r5XhbsGBED8ReDbL3ouPhsFchERvvNuaIWLUCY4gl8OW06SMuvceZrCg7EkSFxxprYurHz7VQ2muxzQHj7RG2k3khxbz2ZAhWIlBBtPtg4oXIQ7cbcwgmBXaTXSBgBe3Y8ywYBjinjEjRJjVAiZkWoPrt8JtZv249XiN0MTVYj0ZW6zmcvjZtRn32U3KLMOdjLnRFUP2I3HJtp99uVlM9ghIpae0EfC0v2g78LkZE1YAKsuqCiiy7DVOhyAZUbOrRwXOEDHxUyXwCmo1zfVkPVhwysx8HhH7Iy0yHAMr0Tb97BqcpmmyBsrSgsV1aT3sjY0ctDLibrxbRXBAOexncqB4BBKWJoWkQwUZkFfPXemZkWYmE72w5CFlI6kuwBQp27dCDZ39kRG7Txs1MbsUnRnNHBy1hSOZvTQRYZPX0VmU8SVGUqzwm1ECBHZakQK4RUquk3txKCqbDfbrNmnsEcjFaiMFWkY3Esg6p3Mm41KWysTpzN6287iXjgGSBw6CBv0hH635WiZ0u47IpUD5mY9rkraDDl5sDgd3f586EWJdKAaou3jR7eYU7YuJT3RQVRI0MuS0ec0xYID3WTUI0ckImz2ck7lrtfnkewzRMZSE2ANBkEmg2XAmwrCv0gy4ExW5DayGRXoqUv06ZLGCcBEiaF0fRMlinhElZTVrGPqqhT03WSq4P97JbXA90zUxiHCnsPjuRTthYl7ZaiVZwNt3RtYT4Ff1VQ5KXRwRzdzkRMsubBX7YEhhtl0ZGVlYiP4N4t00Jr7fB4687eabUqK6jcUVpXEpTvKDbj0JLcLYsneM9fsievUz193f6aMQ5o5fm4Ilx3TUZiX4AUsoyd8CD2SK3NkiLuR255BDIA0Zbgnj2XLyQPiJ1T4fjStpjxKOTzsQsZxpThY9Fvjvoxcs3HAiXjLtZ0TSOX6n4ZLjV3TdJMc4PonwqIb3lAndlTMnuzEPof2dXnpexoVm5c37XQ7fBkoMBJ4ydnW25XKYJbkrueRDSwtJGHjY37dob4jPg0axM5uWbqGocXQ4DyiVm5GhvuYX32RQaOtXXXw8cWK6JcSUnlP1gGLMNZEGeDXOuGWiy4AJ7SH93ZQ4iPgoxdfCuW0qbsLKT2HopcY9dtBIRzr91wnES9lDL49tpuW77LSt5dGA0YLSeWAaZt9bDrduE0gDZQ2yX4SDvAOn4PMcbFRfTqzdZXONmO7ruBHHb1tVFlBFNc4xkoetDO2s7mpiVG6YR4EYMFIG1hBPh7Evhttb34AQzqImSQm1gyL3O7n3p98Kqb9qqIPbN1kuhtW5mIbIioWW2n7MHY7E5mt0'

print(checkPermutation(a, b)) #optional
def permute(str1,str2):
    if sorted(str1) == sorted(str2):
        return True
    else:
        return False

str1="hello"
str2='olehl'
a=permute(str1,str2)
print(a
from collections import defaultdict
def permutation(s1,s2):
    h = defaultdict(int)
    for ch in s1:
        h[ch]+=1
    for ch in s2:
        h[ch]-=1
    for key in h.keys():
        if h[key]!=0 or len(s1)!= len(s2):
            return False
        return True
print(permutation("tictac","tactic"))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM