Consider the following C++ program:
#include <cstdint>
#include <iostream>
int main() {
std::string s = "αa";
std::cout << std::hex << uint32_t(s[0]) << std::endl;
std::cout << std::hex << uint32_t(s[1]) << std::endl;
std::cout << std::hex << uint32_t(s[2]) << std::endl;
}
which prints
ffffffce
ffffffb1
61
How can I replicate the casting behavior in Python? Ie. how can I obtain a numpy array of type uint32_t containing the 3 numbers? 1
For example
import numpy as np
s = "αa"
s = s.encode('utf-8')
for c in bytearray(s):
h = print(hex(np.uint32(c)))
will result in
0xce
0xb1
0x61
which is not sufficient. I have also looked into the functionality provided by the ctypes module but could not find a working solution.
Motivation: I would like to apply a Fowler–Noll–Vo hash function , which relies on bit-wise operations, matching an existing C++ implementation that operates by casting the elements of a std::string
to uint32_t
.
1 While output of the C++ version depends on the architecture / compiler, I am looking for an implementation that either matches the behavior described in this question, or the behavior of the C++ program when compiling it with the same compiler as the python interpreter is compiled with on.
According to Python doc. :
The bytearray type is a mutable sequence of integers in the range 0 <= x < 256.
IMHO, the conversion in C++ should hence handle the characters as unsigned char
. This can be achieved by a "two-step" cast:
#include <cstdint>
#include <iostream>
typedef unsigned char uchar;
int main() {
std::string s = "αa";
std::cout << std::hex << uint32_t((uchar)s[0]) << std::endl;
std::cout << std::hex << uint32_t((uchar)s[1]) << std::endl;
std::cout << std::hex << uint32_t((uchar)s[2]) << std::endl;
}
Output:
ce
b1
61
Notes:
I consider the initialization std::string s = "αa";
as a bit critical. So, this depends on the source code encoding. (I'm on Windows. Using Windows-1252 encoding as it is usual for a lot of Windows applications would break this program as the string would have two elements only . I just realized that Window-1252 doesn't even encode α
but this doesn't make it better.)
Forcing the characters to unsigned char
, should make the application independent from the signed-ness of the specific char
type of the C++ compiler.
The problem here is that your C++ implementation (as many do and as unfortunately allowed - but not mandated - by the standard) has char
as a signed type, while Python rightly consider bytearray
elements as non-negative values.
The correct solution IMO would be as @Scheff shows in his answer - fix the C++ program, which relies on implementation-defined behavior which generates disputable output. OTOH, if you are forced to match an existing C++ program that cannot be altered, you can easily reproduce this behavior in Python.
In your C++ program when a byte value beyond 127 (and hence negative) gets converted to uint32_t
, it gets wrapped around 2³², hence all those ffffffxx
values.
To obtain the same result in Python you can manually cast to int8
(ie char
in your C++ implementation) first:
import numpy as np
s = "αa"
s = s.encode('utf-8')
for c in bytearray(s):
h = print(hex(np.uint32(np.int8(c))))
which outputs:
0xffffffce
0xffffffb1
0x61
The fact that you got 0xffffffce
for the first character is implementation dependent and a valid C++ implementation could return also 0xce
because the difference depends on the default char
type being signed or unsigned (some compilers provide a command line switch to change the behavior so it's not even just platform-dependent, but compile-options dependent).
That said you can fix an unsigned character converted to uint32 to the same result of a conversion of a signed one by simply extending the 8th bit or by converting to the corresponding signed value before doing the casting... for example
print(hex(np.uint32(c if c < 128 else c-256)))
One way to get a numpy array of uint32 is to pass it through an int8 array first:
>>> s = 'αa'
>>> a = np.array(list(s.encode('utf8')),dtype=np.int8)
>>> b = np.array(a,dtype=np.uint32)
>>> b
array([4294967246, 4294967217, 97], dtype=uint32)
>>> for c in b: print(hex(c))
...
0xffffffce
0xffffffb1
0x61
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.