如何安全地将Java字节用作unsigned char？

Question

I am porting some C code that uses a lot of bit manipulation into Java. 我正在移植一些在Java中使用大量位操作的C代码。 The C code operates under the assumption that int is 32 bits wide and char is 8 bits wide. C代码在假设int为32位宽且char为8位宽的情况下运行。 There are assertions in it that check whether those assumptions are valid. 其中有断言检查这些假设是否有效。

I have already come to terms with the fact that I'll have to use long in place of unsigned int . 我已经接受了这样一个事实：我将不得不使用long代替unsigned int 。 But can I safely use byte as a replacement for unsigned char ? 但我可以安全地使用byte作为unsigned char的替代品吗？

They merely represent bytes, but I have already run into this bizarre incident: ( data is an unsigned char * in C and a byte[] in Java): 它们只代表字节，但我已经遇到了这个奇怪的事件:( data是C中的unsigned char *和Java中的byte[] ）：

/* C */
uInt32 c = (data[0] << 24) | (data[1] << 16) | (data[2] << 8) | data[3];

/* Java */
long a = ((data[0] << 24) | (data[1] << 16) | (data[2] << 8) | data[3]) & 0xffffffff;
long b = ((data[0] & 0xff) << 24) | ((data[1] & 0xff) << 16) |
          ((data[2] & 0xff) << 8) | (data[3] & 0xff) & 0xffffffff;

You would think a left shift operation is safe. 你会认为左移操作是安全的。 But due strange unary promotion rules in Java, a and b are not going to be the same if some of the bytes in data are "negative" ( b gives the correct result). 但是由于Java中奇怪的一元推广规则，如果data中的某些字节是“负数”（ b给出正确的结果），则a和b不会相同。

What other "gotchas" should I be aware of? 我应该注意哪些其他“陷阱”？ I really don't want to use short here. 我真的不想在这里使用short 。

Answer 1

You can safely use a byte to represent a value between 0 and 255 if you make sure to bitwise-AND its value with 255 (or 0xFF) before using it in computations. 如果在计算中使用它之前确保将其值与255（或0xFF）进行按位和运算，则可以安全地使用一个byte来表示0到255之间的值。 This promotes it to an int , and ensures the promoted value is between 0 and 255. 这将它提升为int ，并确保提升的值介于0到255之间。

Otherwise, integer promotion would result in an int value between -128 and 127, using sign extension. 否则，整数提升将导致-128和127之间的int值，使用符号扩展名。 -127 as a byte (hex 0x81) would become -127 as an int (hex 0xFFFFFF81). -127作为byte （十六进制0x81）将变为-127作为int （十六进制0xFFFFFF81）。

So you can do this: 所以你可以这样做：

long a = (((data[0] & 255) << 24) | ((data[1] & 255) << 16) | ((data[2] & 255) << 8) | (data[3] & 255)) & 0xffffffff;

Note that the first & 255 is unnecessary here, since a later step masks off the extra bits anyway ( & 0xffffffff ). 注意，这里不需要第一个& 255 ，因为后面的步骤无论如何都会掩盖额外的位（ & 0xffffffff ）。 But it's probably simplest to just always include it. 但是，总是包含它可能是最简单的。

Answer 2

... can I safely use byte as a replacement for unsigned char ? ...我可以安全地使用byte作为unsigned char的替代吗？

As you've discovered, not really... No. 正如你所发现的，不是真的......不。

According to Oracle Java documentation , byte is a signed integer type, and though it has 256 distinct values (due to the explicit range specification "It has a minimum value of -128 and a maximum value of 127 (inclusive)" from the documentation) there are values that an unsigned char from C can store, that a byte from Java can't (and vice-versa). 根据Oracle Java文档， byte是有符号整数类型，虽然它有256个不同的值（由于显式范围规范“它的最小值为-128，最大值为127（包括）”，来自文档）有一些值可以存储来自C的unsigned char ，来自Java的一个byte不能（反之亦然）。

That explains the problem you've experienced. 这解释了您遇到的问题。 However, the extent of the problem hasn't been fully demonstrated on your 8-bit-byte implementation. 但是，您的8位字节实现尚未充分证明问题的严重程度。

What other "gotchas" should I be aware of? 我应该注意哪些其他“陷阱”？

Whilst a byte in Java is required to have support for only values between (and including) -128 and 127, Cs unsigned char has maximum value ( UCHAR_MAX ) that depends upon the number of bits used to represent it ( CHAR_BIT ; at least 8). 虽然Java中的一个byte只需要支持（包括）-128和127之间的值，但Cs unsigned char最大值（ UCHAR_MAX ）取决于用于表示它的位数（ CHAR_BIT ;至少为8）。 So when CHAR_BIT is greater than 8, there will be extra values beyond 255 that unsigned char can store. 因此，当CHAR_BIT大于8时， unsigned char可以存储超过255的额外值。

In summary, in the world of Java a byte should really be called an octet (a group of eight bits) where-as in C a byte ( char , signed char , unsigned char ) is a group of at least (possibly more than) eight bits . 总之，在Java的世界中，一个byte实际上应该被称为octet （一组八位），其中 - 在C中一个字节（ char ， signed char ， unsigned char ）是一组至少（可能超过）八位 。

No. They are not equivalent. 不，他们并不等同。 I don't think you'll find an equivalent type in Java, either; 我不认为你会在Java中找到一个等价的类型; they're all rather fixed-width . 它们都是固定宽度的 。 You could safely use byte in Java as an equivalent for int8_t in C, however (except that int8_t isn't required to exist in C unless CHAR_BIT == 8 ). 你可以安全地使用Java中的byte作为C语言中int8_t的等价物（但是除非CHAR_BIT == 8否则不需要在C中存在int8_t ）。

As for pitfalls, there are some in your C code too. 至于陷阱，你的C代码中也有一些。 Assuming data[0] is an unsigned char , data[0] << 24 is undefined behaviour on any system for which INT_MAX == 32767 . 假设data[0]是unsigned char ，则data[0] << 24是 INT_MAX == 32767任何系统上的未定义行为。

如何安全地将Java字节用作unsigned char？

问题描述

2 个解决方案

解决方案1
3 已采纳 2015-07-04 06:03:24

解决方案2
-1 2015-07-04 06:16:39

如何安全地将Java字节用作unsigned char？

问题描述

2 个解决方案

解决方案1 3 已采纳 2015-07-04 06:03:24

解决方案2 -1 2015-07-04 06:16:39

解决方案1
3 已采纳 2015-07-04 06:03:24

解决方案2
-1 2015-07-04 06:16:39