简体   繁体   中英

Bit manipulation with left shift and check string

I am working on some code checking the repeat of character in a string. Here is some answer I found somewhere.

int checker = 0, val =0, max = 0, j =0, count = 0;
        for(int i=0; i<s.size() && j<s.size(); i++)
        {
            j = i;
            while(j<s.size())
            {
                val = s[j]-'a';
                if ((checker & (1<<val)) >0) break;
                checker |= 1 << val;
                j++;
                count++;
            }
            if(count > max) max = count;
            checker = 0;
            count = 0;
        }
        return max;

The method is clear and straight forward. However, I am confused at two lines.

        val = s[j]-'a';
        if ((checker & (1<<val)) >0) break;
        checker |= 1 << val;

What I don't know is that val is some value after subtraction. Then (1 << val) is 1 left shift val, my understanding is 1*2^(val) . Then 1 << val needs to =1 to jump out of loop. But how was it achieved please? Thanks.

Let's break it down line-by-line.

val = s[j]-'a';

This is a nifty trick that will convert any character in the range 'a'->'z' to a number 0-25 . You actually usually see this as s-'0' to convert a digit-character to a number, but it works just as well for letters. It leverages the fact that in the ASCII/UTF8 character space the alphabetic letters are continuous, so if you treat a character as a number and subtract the starting letter, you get the 'offset' of the character with 'a' being 0 and 'z' being 25.

if ((checker & (1<<val)) >0) break;

The key here is to understand what 1<<val will do. This left-shifts a single 1 bit val bits over. So for 'a' you'd get 0b1 , for 'b' you'd get '0b10' , and so on. Effectively, it one-hot encodes a letter to a bit in a 32bit integer. If we then & this whith our checker value, which records the same one-hot bitfield of letters we've already seen, the resulting value will be >0 if and only if checker contained a 1 in the bit representing the letter. If that's the case, we've found a duplicate, So we break.

checker |= 1 << val;

If we've gotten here, it means checker didn't contain a 1 in the bit for that letter. So we've now seen this letter, and need to update checker . |= 'ing it with the val from before will always set exactly that single bit to 1 , while leaving any other bits unchanged.

Piece by piece:

Set val to the current character - 'a' , that means, 'a' gives 0 , 'z' 25

val = s[j]-'a';

Check the bit in checker: if the bit val is already set in the checker , then break. This works by logically anding the value against the bitmask; if the bit is set, the value should be positive (assumptions, assumptions).

if ((checker & (1<<val)) >0) break;

Else set the bit val to 1 by orring it.

checker |= 1 << val;

The code makes lots of assumptions; for example int needs to have at least 26 bits, and characters outside 'a' - 'z' in the string could cause undefined behaviour.

The code's author is using the variable 'checker' as a bit mask to remember which characters he has already seen. The line:

val = s[j] - 'a';

is normalizing the ASCII value of the character s[j] down by the ASCII value of 'a'. Basically, he is figuring out which letter of the alphabet this character is in the range [0, 25] for lower case alpha characters: a is 0, b is 1, c is 2 and so on.

He is then checking if this bit is already set in 'checker' or not. He does this by left shifting 1 by the normalized value and AND'ing it with 'checker.' If that bit is not set in 'checker', then the bit-wise AND will return zero and the loop will continue. If it is set, then the AND will return non-zero and his test will break the loop.

When the bit is not set, he is then setting that bit in 'checker' that corresponds to that position. If the character was 'a' then the least significant bit is set, 'b' then the second least significant bit is set and so on by bitwise OR'ing 1 left-shifted by 'val' with the existing 'checker'.

PS - He could have just as easily made 'checker' be an array of 26 characters and done:

char checker[26] = { 0 };
...
    while(j < s.size() && !checker[s[j] - 'a'])
    {
        checker[s[j] - 'a'] = 1;
        ++j;
        ++count;
    }
...

I'm sure you would have understood that. That's basically what he is doing but is stuffing the array into a bit mask instead using some bit manipulation. That way he can also easily clear the set bits simply by setting checker to zero.

The funny piece of code you show us takes a few assumptions:

  1. The string s only contains lower case letters ('a'..'z').
  2. The type int has 32 bits (or more).

What the code does is to set a bit in the checker variable for each character it found so far (26 lower case characters fits in to some 31/32 bit int, 1 bit being associated with one character). He had better used some uint32_t, btw.

By subtracting 'a' from the current character he gets values (0..25) if his string holds assumption 1.

The if() expression tests if that bit has been set before, ie if the character occured before.

No matter which bit is set in checker, it is != 0. And if assumption 1 holds, it is always > 0. (no way to reach bit 31, which is the sign bit.)

Every bit of checker starting from right to left is marked for every character found. Lets say if there is b found in the string then second bit from right is set.. And if its c then it's the third bit... And this checker bitmask is used for matching subsequent characters.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM