简体   繁体   中英

Resolving equal XOR values for different strings for anagram detection

I recently had an interview question where I had to write a function that takes two strings, and it would return 1 if they are anagrams of each other or else return 0 . To simplify things, both strings are of the same length, non-empty, and only contain lower-case alphabetical and numeric characters.

What I implemented a function that accumulates the XOR value of each character of each string independently then compared the final XOR values of each string to see if they are equal. If they are, I would return 1 , else return 0 .

My function:

int isAnagram(char* str1, char* str2){
    int xor_acc_1 = 0;
    int xor_acc_2 = 0;
    for(int i = 0; i<strlen(str1); i++){
        xor_acc_1 ^= str1[i] - '0';
        xor_acc_2 ^= str2[i] - '0';
    }
    return xor_acc_1 == xor_acc_2;
}

My function worked for every case except for one test case.

char* str1 = "123";
char* str2 = "303";

To my surprise, even though these two strings are not anagrams of each other, they both returned 48 as their XOR value.

My question is: Can this be resolve still with XOR in linear time, without the usage of a data structure eg a Map, through modification on the mathematics behind XOR?

A pure xor solution will not work as there is information lost during the process (this problem is likely to exist in other forms of lossy calculation as well, such as hashing). The information lost in this case is the actual characters being used for comparison.

By way of example, consider the two strings ae and bf (in ASCII):

  a: 0110 0001    b: 0110 0010
  e: 0110 0101    f: 0110 0110
     ---- ----       ---- ----
xor: 0000 0100       0000 0100

You can see that the result of the xor is identical for both string despite the fact they are totally different.

This may become even more obvious once you realise that any value xor -ed with itself is zero, meaning that all strings like aa , bb , cc , xx , and so on, would be considered anagrams under your scheme.

So, now you've established that method as unsuitable, there are a couple of options that spring to mind.


The first is to simply sort both strings and compare them. Once sorted, they will be identical on a character-by-character basis. This will work but it's unlikely to deliver your requested O(n) time complexity since you'll almost certainly be using a comparison style sort.


The second still allows you to meet that requirement by using the usual "trick" of trading space for time. You simply set up a count of each character (all initially zero) then, for each character in the first string, increase its count.

After that, for each character in the second string, decrease its count.

That's linear time complexity and the strings can be deemed to be anagrams if every character count is set to zero after the process. Any non-zero count will only be there if a character occurred more times in one string than the other.

This is effectively a counting sort , a non-comparison sort meaning it's not subject to the normal minimum O(n log n) time complexity for those sorts.

The pseudo-code for such a beast would be:

def isAnagram(str1, str2):
    if len(str1) != len(str2):    # Can also handle different lengths.
        return false

    dim count[0..255] = {0}       # Init all counts to zero.

    for each code in str1:        # Increase for each char in string 1.
        count[code]++

    for each code in str2:        # Decrease for each char in string 2.
        count[code]--

    for each code in 0..255:
        if count[code] != 0:      # Any non-zero means non-anagram.
            return false    

    return true                   # All zero means anagram.

Here, by the way, is a complete C test program which illustrates this concept, able to handle 8-bit character widths though more widths can be added with a simple change to the #if section:

#include <stdio.h>
#include <string.h>
#include <limits.h>
#include <stdbool.h>

#if CHAR_BIT == 8
    #define ARRSZ 256
#else
    #error Need to adjust for unexpected CHAR_BIT.
#endif

static bool isAnagram(unsigned char *str1, unsigned char *str2) {
    // Ensure strings are same size.

    size_t len = strlen(str1);
    if (len != strlen(str2))
        return false;

    // Initialise all counts to zero.

    int count[ARRSZ];
    for (size_t i = 0; i < sizeof(count) / sizeof(*count); ++i)
        count[i] = 0;

    // Increment for string 1, decrement for string 2.

    for (size_t i = 0; i < len; ++i) {
        count[str1[i]]++;
        count[str2[i]]--;
    }

    // Any count non-zero means non-anagram.

    for (size_t i = 0; i < sizeof(count) / sizeof(*count); ++i)
        if (count[i] != 0)
            return false;

    // All counts zero means anagram.

    return true;
}

int main(int argc, char *argv[]) {
    if ((argc - 1) % 2 != 0) {
        puts("Usage: check_anagrams [<string1> <string2>] ...");
        return 1;
    }

    for (size_t i = 1; i < argc; i += 2) {
        printf("%s: '%s' '%s'\n",
            isAnagram(argv[i], argv[i + 1]) ? "Yes" : " No",
            argv[i], argv[i + 1]);
    }

    return 0;
}

Running this on some suitable test data shows it in action:

pax$ ./check_anagrams ' paxdiablo ' 'a plaid box' paxdiablo PaxDiablo \
         one two aa bb aa aa '' '' paxdiablo pax.diablo

Yes: ' paxdiablo ' 'a plaid box'
 No: 'paxdiablo' 'PaxDiablo'
 No: 'one' 'two'
 No: 'aa' 'bb'
Yes: 'aa' 'aa'
Yes: '' ''
 No: 'paxdiablo' 'pax.diablo'

Why do you need to do XOR on the first place?

The most simple and fast enough approach is sort both the string by character and compare if both of them are equal or not. In this case, if you need faster sorting algorithm, you can use counting sort to achieve linear time.

Another way is, you can simply count the number of characters in each string and check if those counts are equal.

EDIT

Your XOR based solution is not right in terms of correctness. There can be more than one combination of characters which can XOR up to a same number, the XOR of characters/ASCII codes of two different strings might not yield to different XOR all the time. So for same string, the output will be always correct. But for different string, the output MAY not be correct always (False positive).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM