How do I find 8-bit substrings in strings with ascii values exceeding 127?

Question

I'm struggling to work through an issue I'm running into trying to work with bitwise substrings in strings. In the example below, this simple little function does what it is supposed to for values 0-127, but fails if I attempt to work with ASCII values greater than 127. I assume this is because the string itself is signed. However, if I make it unsigned, I not only run into issues because apparently strlen() doesn't operate on unsigned strings, but I get a warning that it is a multi-char constant. Why the multiple chars? I think I have tried everything. Is there something I could do to make this work on values > 127?

#include <iostream>
#include <cstring>

const unsigned char DEF_KEY_MINOR = 0xAD;

const char *buffer = { "jhsi≠uhdfiwuui73" };

size_t isOctetInString(const char *buffer, const unsigned char octet)
{
  size_t out = 0;
  for (size_t i = 0; i < strlen(buffer); ++i)
  {
    if(!(buffer[i] ^ octet))
    {
      out = i;
      break;
    }
  }
  return out;
}

int main() {
    std::cout << isOctetInString(buffer, 'i') << "\n";
    std::cout << isOctetInString(buffer, 0x69) << "\n";
    std::cout << isOctetInString(buffer, '≠') << "\n";
    std::cout << isOctetInString(buffer, 0xAD) << "\n";
    return 0;
}

output

Edit

Based on comments I have tried a few different things including casting the octet and buffer to unsigned int, and wchar_t, and removing the unsigned char from the octet parameter type. With either of these the outputs I am getting are

I even tried substituting the ≠ char in the buffer with

const char *buffer = {'0xAD', "jhsiuhdfiwuui73"};

however I still get warnings about multibyte characters.

As I said before, my main concern is to be able to find the bit sequence 0xAD within a string, but I am seeing now that using ascii characters or any construct making use of the ascii character set will cause issues. Since 0xAD is only 8 bits, there must be a way of doing this. Does anyone know a method for doing so?

Answer 1

Sign extension -- buffer[i]^octet is really unsigned(int(buffer[i])) ^ unsigned(octet). If you want buffer[] to be unsigned char, you have to define it that way.

Answer 2

There are multiple sources of confusion in your problem:

searching for an unsigned char value in a string can be done with strchr() which converts both the int argument and the characters in the char array to unsigned char for the comparison.
your function uses if(!(buffer[i] ^ octet)) to detect a match, which does not work if char is signed because the expression is evaluated as if(!((int)buffer[i] ^ (int)octet)) and the sign extension only occurs for buffer[i] . A simple solution is:
```
 if ((unsigned char)buffer[i] == octet)
```
Note that the character ≠ might be encoded as multiple bytes on your target system, both in the source code and the terminal handling, for example code point ≠ is 8800 or 0x2260 is encoded as 0xE2 0x89 0xA0 in UTF-8. The syntax '≠' would then pose a problem. I'm not sure how C++ deals with multi-byte character constants, but C would accept them with an implementation specific value.

To see how your system handles non-ASCII bytes, you could add these lines to your main() function:

 std::cout << "≠ uses " << sizeof("≠") - 1 << "bytes\n"; std::cout << "'≠' has the value " << (int)'≠' << "\n";

or more explicitly:

 printf("≠ is encoded as"); for (size_t i = 0; i < sizeof("≠") - 1; i++) { printf(" %02hhX", "≠"[i]); } printf(" and '≠' has a value of 0x%X\n", '≠');

On my linux system, the latter outputs:

 ≠ is encoded as E2 89 A0 and '≠' has a value of 0xE289A0

On my MacBook, compilation fails with this error:

 notequal.c:8:48: error: character too large for enclosing character literal type printf(" and '≠' has a value of 0x%X\n", '≠');

How do I find 8-bit substrings in strings with ascii values exceeding 127?

Question

2 answers

solution1
1 2020-08-16 23:59:04

solution2
0 2020-08-17 19:45:32

How do I find 8-bit substrings in strings with ascii values exceeding 127?

Question

2 answers

solution1 1 2020-08-16 23:59:04

solution2 0 2020-08-17 19:45:32

solution1
1 2020-08-16 23:59:04

solution2
0 2020-08-17 19:45:32