简体   繁体   English

为不同的字符串解析相等的XOR值以进行字谜检测

[英]Resolving equal XOR values for different strings for anagram detection

I recently had an interview question where I had to write a function that takes two strings, and it would return 1 if they are anagrams of each other or else return 0 . 最近,我遇到一个采访问题,我必须编写一个包含两个字符串的函数,如果它们是彼此的字谜,它将返回1 ,否则将返回0 To simplify things, both strings are of the same length, non-empty, and only contain lower-case alphabetical and numeric characters. 为简化起见,两个字符串的长度相同,非空,并且仅包含小写字母和数字字符。

What I implemented a function that accumulates the XOR value of each character of each string independently then compared the final XOR values of each string to see if they are equal. 我实现了一个函数,该函数独立地累加每个字符串的每个字符的XOR值,然后比较每个字符串的最终XOR值以查看它们是否相等。 If they are, I would return 1 , else return 0 . 如果是,则返回1 ,否则返回0

My function: 我的功能:

int isAnagram(char* str1, char* str2){
    int xor_acc_1 = 0;
    int xor_acc_2 = 0;
    for(int i = 0; i<strlen(str1); i++){
        xor_acc_1 ^= str1[i] - '0';
        xor_acc_2 ^= str2[i] - '0';
    }
    return xor_acc_1 == xor_acc_2;
}

My function worked for every case except for one test case. 除一个测试用例外,我的功能适用于所有情况。

char* str1 = "123";
char* str2 = "303";

To my surprise, even though these two strings are not anagrams of each other, they both returned 48 as their XOR value. 令我惊讶的是,即使这两个字符串不是彼此的字词,它们都返回48作为其XOR值。

My question is: Can this be resolve still with XOR in linear time, without the usage of a data structure eg a Map, through modification on the mathematics behind XOR? 我的问题是:通过修改XOR背后的数学运算,是否仍可以在线性时间内使用XOR来解决问题,而无需使用数据结构(例如Map)?

A pure xor solution will not work as there is information lost during the process (this problem is likely to exist in other forms of lossy calculation as well, such as hashing). 纯粹的xor解决方案将不起作用,因为在此过程中会丢失信息(此问题也可能以其他形式的有损计算形式存在,例如散列)。 The information lost in this case is the actual characters being used for comparison. 在这种情况下丢失的信息是用于比较的实际字符。

By way of example, consider the two strings ae and bf (in ASCII): 例如,考虑两个字符串aebf (以ASCII表示):

  a: 0110 0001    b: 0110 0010
  e: 0110 0101    f: 0110 0110
     ---- ----       ---- ----
xor: 0000 0100       0000 0100

You can see that the result of the xor is identical for both string despite the fact they are totally different. 您可以看到两个字符串的xor或结果是相同的, 尽管它们完全不同。

This may become even more obvious once you realise that any value xor -ed with itself is zero, meaning that all strings like aa , bb , cc , xx , and so on, would be considered anagrams under your scheme. 一旦意识到与自己进行xor任何值均为零,这可能变得更加明显,这意味着在您的方案下,所有字符串(例如aabbccxx等)都将被视为字谜。

So, now you've established that method as unsuitable, there are a couple of options that spring to mind. 因此,现在您已经将该方法确定为不合适的方法,因此您会想到很多选择。


The first is to simply sort both strings and compare them. 首先是简单地对两个字符串进行排序并进行比较。 Once sorted, they will be identical on a character-by-character basis. 一旦排序,它们将在每个字符的基础上相同。 This will work but it's unlikely to deliver your requested O(n) time complexity since you'll almost certainly be using a comparison style sort. 这将起作用,但是由于您几乎肯定会使用比较样式排序,因此不太可能提供您所请求的O(n)时间复杂度。


The second still allows you to meet that requirement by using the usual "trick" of trading space for time. 第二个仍然允许您通过使用通常的交易时间“技巧”来满足该要求。 You simply set up a count of each character (all initially zero) then, for each character in the first string, increase its count. 您只需设置每个字符的计数(所有初始都为零),然后为第一个字符串中的每个字符增加其计数。

After that, for each character in the second string, decrease its count. 之后,对于第二个字符串中的每个字符, 减少其计数。

That's linear time complexity and the strings can be deemed to be anagrams if every character count is set to zero after the process. 这是线性时间复杂度,如果在处理后将每个字符计数都设置为零,则字符串可以视为字谜。 Any non-zero count will only be there if a character occurred more times in one string than the other. 仅当一个字符在一个字符串中的出现次数比另一字符串中的出现次数更多时,才会出现任何非零计数。

This is effectively a counting sort , a non-comparison sort meaning it's not subject to the normal minimum O(n log n) time complexity for those sorts. 这实际上是一种计数排序 ,是一种非比较排序,这意味着它们不受这些排序的正常最小O(n log n)时间复杂度的限制。

The pseudo-code for such a beast would be: 这种野兽的伪代码是:

def isAnagram(str1, str2):
    if len(str1) != len(str2):    # Can also handle different lengths.
        return false

    dim count[0..255] = {0}       # Init all counts to zero.

    for each code in str1:        # Increase for each char in string 1.
        count[code]++

    for each code in str2:        # Decrease for each char in string 2.
        count[code]--

    for each code in 0..255:
        if count[code] != 0:      # Any non-zero means non-anagram.
            return false    

    return true                   # All zero means anagram.

Here, by the way, is a complete C test program which illustrates this concept, able to handle 8-bit character widths though more widths can be added with a simple change to the #if section: 顺便说一下,这里是一个完整的C测试程序,它说明了这个概念,尽管可以对#if部分进行简单的更改即可添加更多的宽度,但它能够处理8位字符的宽度:

#include <stdio.h>
#include <string.h>
#include <limits.h>
#include <stdbool.h>

#if CHAR_BIT == 8
    #define ARRSZ 256
#else
    #error Need to adjust for unexpected CHAR_BIT.
#endif

static bool isAnagram(unsigned char *str1, unsigned char *str2) {
    // Ensure strings are same size.

    size_t len = strlen(str1);
    if (len != strlen(str2))
        return false;

    // Initialise all counts to zero.

    int count[ARRSZ];
    for (size_t i = 0; i < sizeof(count) / sizeof(*count); ++i)
        count[i] = 0;

    // Increment for string 1, decrement for string 2.

    for (size_t i = 0; i < len; ++i) {
        count[str1[i]]++;
        count[str2[i]]--;
    }

    // Any count non-zero means non-anagram.

    for (size_t i = 0; i < sizeof(count) / sizeof(*count); ++i)
        if (count[i] != 0)
            return false;

    // All counts zero means anagram.

    return true;
}

int main(int argc, char *argv[]) {
    if ((argc - 1) % 2 != 0) {
        puts("Usage: check_anagrams [<string1> <string2>] ...");
        return 1;
    }

    for (size_t i = 1; i < argc; i += 2) {
        printf("%s: '%s' '%s'\n",
            isAnagram(argv[i], argv[i + 1]) ? "Yes" : " No",
            argv[i], argv[i + 1]);
    }

    return 0;
}

Running this on some suitable test data shows it in action: 在一些合适的测试数据上运行它可以显示出实际效果:

pax$ ./check_anagrams ' paxdiablo ' 'a plaid box' paxdiablo PaxDiablo \
         one two aa bb aa aa '' '' paxdiablo pax.diablo

Yes: ' paxdiablo ' 'a plaid box'
 No: 'paxdiablo' 'PaxDiablo'
 No: 'one' 'two'
 No: 'aa' 'bb'
Yes: 'aa' 'aa'
Yes: '' ''
 No: 'paxdiablo' 'pax.diablo'

Why do you need to do XOR on the first place? 为什么首先需要进行XOR?

The most simple and fast enough approach is sort both the string by character and compare if both of them are equal or not. 最简单,最快的方法是按字符对字符串进行排序,然后比较两者是否相等。 In this case, if you need faster sorting algorithm, you can use counting sort to achieve linear time. 在这种情况下,如果您需要更快的排序算法,则可以使用计数排序来实现线性时间。

Another way is, you can simply count the number of characters in each string and check if those counts are equal. 另一种方法是,您可以简单地计算每个字符串中的字符数,然后检查这些计数是否相等。

EDIT 编辑

Your XOR based solution is not right in terms of correctness. 基于XOR的解决方案在正确性方面是不正确的。 There can be more than one combination of characters which can XOR up to a same number, the XOR of characters/ASCII codes of two different strings might not yield to different XOR all the time. 最多可以对一个数字进行异或运算,不止一个字符组合,两个不同字符串的字符/ ASCII码的异或运算可能不会一直产生。 So for same string, the output will be always correct. 因此,对于相同的字符串,输出将始终正确。 But for different string, the output MAY not be correct always (False positive). 但是对于不同的字符串,输出可能不一定总是正确的(错误肯定)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM