简体   繁体   中英

Match implementation specific C++ char to uint32_t cast in Python

Consider the following C++ program:

#include <cstdint>
#include <iostream>

int main() {
  std::string s = "αa";
  std::cout << std::hex << uint32_t(s[0]) << std::endl;
  std::cout << std::hex << uint32_t(s[1]) << std::endl;
  std::cout << std::hex << uint32_t(s[2]) << std::endl;
}

which prints

ffffffce
ffffffb1
61

How can I replicate the casting behavior in Python? Ie. how can I obtain a numpy array of type uint32_t containing the 3 numbers? 1

For example

import numpy as np

s = "αa"
s = s.encode('utf-8')
for c in bytearray(s):
    h = print(hex(np.uint32(c)))

will result in

0xce
0xb1
0x61

which is not sufficient. I have also looked into the functionality provided by the ctypes module but could not find a working solution.

Motivation: I would like to apply a Fowler–Noll–Vo hash function , which relies on bit-wise operations, matching an existing C++ implementation that operates by casting the elements of a std::string to uint32_t .

1 While output of the C++ version depends on the architecture / compiler, I am looking for an implementation that either matches the behavior described in this question, or the behavior of the C++ program when compiling it with the same compiler as the python interpreter is compiled with on.

According to Python doc. :

The bytearray type is a mutable sequence of integers in the range 0 <= x < 256.

IMHO, the conversion in C++ should hence handle the characters as unsigned char . This can be achieved by a "two-step" cast:

#include <cstdint>
#include <iostream>

typedef unsigned char uchar;

int main() {
  std::string s = "αa";
  std::cout << std::hex << uint32_t((uchar)s[0]) << std::endl;
  std::cout << std::hex << uint32_t((uchar)s[1]) << std::endl;
  std::cout << std::hex << uint32_t((uchar)s[2]) << std::endl;
}

Output:

ce
b1
61

Live Demo on coliru

Notes:

  1. I consider the initialization std::string s = "αa"; as a bit critical. So, this depends on the source code encoding. (I'm on Windows. Using Windows-1252 encoding as it is usual for a lot of Windows applications would break this program as the string would have two elements only . I just realized that Window-1252 doesn't even encode α but this doesn't make it better.)

  2. Forcing the characters to unsigned char , should make the application independent from the signed-ness of the specific char type of the C++ compiler.

The problem here is that your C++ implementation (as many do and as unfortunately allowed - but not mandated - by the standard) has char as a signed type, while Python rightly consider bytearray elements as non-negative values.

The correct solution IMO would be as @Scheff shows in his answer - fix the C++ program, which relies on implementation-defined behavior which generates disputable output. OTOH, if you are forced to match an existing C++ program that cannot be altered, you can easily reproduce this behavior in Python.

In your C++ program when a byte value beyond 127 (and hence negative) gets converted to uint32_t , it gets wrapped around 2³², hence all those ffffffxx values.

To obtain the same result in Python you can manually cast to int8 (ie char in your C++ implementation) first:

import numpy as np

s = "αa"
s = s.encode('utf-8')
for c in bytearray(s):
    h = print(hex(np.uint32(np.int8(c))))

which outputs:

0xffffffce
0xffffffb1
0x61

The fact that you got 0xffffffce for the first character is implementation dependent and a valid C++ implementation could return also 0xce because the difference depends on the default char type being signed or unsigned (some compilers provide a command line switch to change the behavior so it's not even just platform-dependent, but compile-options dependent).

That said you can fix an unsigned character converted to uint32 to the same result of a conversion of a signed one by simply extending the 8th bit or by converting to the corresponding signed value before doing the casting... for example

print(hex(np.uint32(c if c < 128 else c-256)))

One way to get a numpy array of uint32 is to pass it through an int8 array first:

>>> s = 'αa'
>>> a = np.array(list(s.encode('utf8')),dtype=np.int8)
>>> b = np.array(a,dtype=np.uint32)
>>> b
array([4294967246, 4294967217,         97], dtype=uint32)
>>> for c in b: print(hex(c))
...
0xffffffce
0xffffffb1
0x61

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM