简体   繁体   English

在C中安全地将char *加倍

[英]Safely punning char* to double in C

In an Open Source program I wrote , I'm reading binary data (written by another program) from a file and outputting ints, doubles, and other assorted data types. 我写的一个开源程序中,我正在读取文件中的二进制数据(由另一个程序编写)并输出整数,双精度和其他各种数据类型。 One of the challenges is that it needs to run on 32-bit and 64-bit machines of both endiannesses, which means that I end up having to do quite a bit of low-level bit-twiddling. 其中一个挑战是它需要在两个端点的32位和64位机器上运行,这意味着我最终不得不做一些低级别的bit-twiddling。 I know a (very) little bit about type punning and strict aliasing and want to make sure I'm doing things the right way. 我知道(非常)关于类型惩罚和严格别名的一点点,并且想要确保我正确地做事。

Basically, it's easy to convert from a char* to an int of various sizes: 基本上,很容易从char *转换为各种大小的int:

int64_t snativeint64_t(const char *buf) 
{
    /* Interpret the first 8 bytes of buf as a 64-bit int */
    return *(int64_t *) buf;
}

and I have a cast of support functions to swap byte orders as needed, such as: 我有一组支持函数来根据需要交换字节顺序,例如:

int64_t swappedint64_t(const int64_t wrongend)
{
    /* Change the endianness of a 64-bit integer */
    return (((wrongend & 0xff00000000000000LL) >> 56) |
            ((wrongend & 0x00ff000000000000LL) >> 40) |
            ((wrongend & 0x0000ff0000000000LL) >> 24) |
            ((wrongend & 0x000000ff00000000LL) >> 8)  |
            ((wrongend & 0x00000000ff000000LL) << 8)  |
            ((wrongend & 0x0000000000ff0000LL) << 24) |
            ((wrongend & 0x000000000000ff00LL) << 40) |
            ((wrongend & 0x00000000000000ffLL) << 56));
}

At runtime, the program detects the endianness of the machine and assigns one of the above to a function pointer: 在运行时,程序检测机器的字节顺序,并将上述之一分配给函数指针:

int64_t (*slittleint64_t)(const char *);
if(littleendian) {
    slittleint64_t = snativeint64_t;
} else {
    slittleint64_t = sswappedint64_t;
}

Now, the tricky part comes when I'm trying to cast a char* to a double. 现在,当我试图将char *转换为double时,棘手的部分就出现了。 I'd like to re-use the endian-swapping code like so: 我想重新使用endian-swapping代码,如下所示:

union 
{
    double  d;
    int64_t i;
} int64todouble;

int64todouble.i = slittleint64_t(bufoffset);
printf("%lf", int64todouble.d);

However, some compilers could optimize away the "int64todouble.i" assignment and break the program. 但是,一些编译器可以优化掉“int64todouble.i”赋值并打破程序。 Is there a safer way to do this, while considering that this program must stay optimized for performance, and also that I'd prefer not to write a parallel set of transformations to cast char* to double directly? 有没有更安全的方法来做到这一点,同时考虑到这个程序必须保持性能优化,而且我更愿意不编写一组并行的转换来直接将char *转换为double? If the union method of punning is safe, should I be re-writing my functions like snativeint64_t to use it? 如果双关语的联合方法是安全的,我应该重新编写像snativeint64_t这样的函数来使用吗?


I ended up using Steve Jessop's answer because the conversion functions re-written to use memcpy, like so: 我最终使用了Steve Jessop的答案,因为转换函数重写为使用memcpy,如下所示:

int64_t snativeint64_t(const char *buf) 
{
    /* Interpret the first 8 bytes of buf as a 64-bit int */
    int64_t output;
    memcpy(&output, buf, 8);
    return output;
}

compiled into the exact same assembler as my original code: 编译成与原始代码完全相同的汇编程序:

snativeint64_t:
        movq    (%rdi), %rax
        ret

Of the two, the memcpy version more explicitly expresses what I'm trying to do and should work on even the most naive compilers. 在这两个中,memcpy版本更明确地表达了我正在尝试做的事情,甚至应该对最天真的编译器起作用。

Adam, your answer was also wonderful and I learned a lot from it. 亚当,你的答案也很精彩,我从中学到了很多东西。 Thanks for posting! 谢谢发帖!

I highly suggest you read Understanding Strict Aliasing . 我强烈建议你阅读Understanding Strict Aliasing Specifically, see the sections labeled "Casting through a union". 具体来说,请参阅标记为“通过联合进行转换”的部分。 It has a number of very good examples. 它有很多很好的例子。 While the article is on a website about the Cell processor and uses PPC assembly examples, almost all of it is equally applicable to other architectures, including x86. 虽然该文章位于关于Cell处理器的网站上并使用PPC汇编示例,但几乎所有这些都适用于其他架构,包括x86。

The standard says that writing to one field of a union and reading from it immediately is undefined behaviour. 写标准说,写入联合的一个字段并立即从中读取是不确定的行为。 So if you go by the rule book, the union based method won't work. 因此,如果按规则书进行操作,基于联合的方法将无效。

Macros are usually a bad idea, but this might be an exception to the rule. 宏通常是一个坏主意,但这可能是规则的一个例外。 It should be possible to get template-like behaviour in C using a set of macros using the input and output types as parameters. 应该可以使用输入和输出类型作为参数,使用一组宏在C中获得类似模板的行为。

Since you seem to know enough about your implementation to be sure that int64_t and double are the same size, and have suitable storage representations, you might hazard a memcpy. 由于您似乎对您的实现有足够的了解以确保int64_t和double的大小相同,并且具有合适的存储表示,因此您可能会损害memcpy。 Then you don't even have to think about aliasing. 那你甚至不必考虑别名。

Since you're using a function pointer for a function that might easily be inlined if you were willing to release multiple binaries, performance must not be a huge issue anyway, but you might like to know that some compilers can be quite fiendish optimising memcpy - for small integer sizes a set of loads and stores can be inlined, and you might even find the variables are optimised away entirely and the compiler does the "copy" simply be reassigning the stack slots it's using for the variables, just like a union. 因为如果你愿意发布多个二进制文件,你可以使用函数指针来轻松地内联函数,那么性能绝不是一个大问题,但你可能想知道一些编译器可能非常恶劣优化memcpy - 对于小整数大小,可以内联一组加载和存储,您甚至可以发现变量完全被优化,编译器执行“复制”只是重新分配它用于变量的堆栈槽,就像一个联合。

int64_t i = slittleint64_t(buffoffset);
double d;
memcpy(&d,&i,8); /* might emit no code if you're lucky */
printf("%lf", d);

Examine the resulting code, or just profile it. 检查生成的代码,或者只是对其进行分析。 Chances are even in the worst case it will not be slow. 即使在最坏的情况下,机会也不会很慢。

In general, though, doing anything too clever with byteswapping results in portability issues. 但是,一般情况下,使用byteswapping做一些太聪明的事情会导致可移植性问题。 There exist ABIs with middle-endian doubles, where each word is little-endian, but the big word comes first. 存在具有中端双精度的ABI,其中每个单词都是小尾数,但是大词首先出现。

Normally you could consider storing your doubles using sprintf and sscanf, but for your project the file formats aren't under your control. 通常你可以考虑使用sprintf和sscanf来存储你的双打,但对于你的项目,文件格式不在你的控制之下。 But if your application is just shovelling IEEE doubles from an input file in one format to an output file in another format (not sure if it is, since I don't know the database formats in question, but if so), then perhaps you can forget about the fact that it's a double, since you aren't using it for arithmetic anyway. 但是,如果您的应用程序只是将IEEE双打从一种格式的输入文件转换为另一种格式的输出文件(不确定是否,因为我不知道有问题的数据库格式,但如果是这样),那么也许你可以忘记这是一个双倍的事实,因为你还没有用它来算术。 Just treat it as an opaque char[8], requiring byteswapping only if the file formats differ. 只需将其视为不透明字符[8],只有在文件格式不同时才需要字节翻转。

As a very small sub-suggestion, I suggest you investigate if you can swap the masking and the shifting, in the 64-bit case. 作为一个非常小的子建议,我建议您调查是否可以在64位情况下交换屏蔽和移位。 Since the operation is swapping bytes, you should be able to always get away with a mask of just 0xff . 由于操作是交换字节,因此您应该能够始终使用仅为0xff的掩码。 This should lead to faster, more compact code, unless the compiler is smart enough to figure that one out itself. 这应该会导致更快,更紧凑的代码,除非编译器足够聪明,可以自己解决这个问题。

In brief, changing this: 简而言之,改变这个:

(((wrongend & 0xff00000000000000LL) >> 56)

into this: 进入这个:

((wrongend >> 56) & 0xff)

should generate the same result. 应该产生相同的结果。

Edit: 编辑:
Removed comments regarding how to effectively store data always big endian and swapping to machine endianess, as questioner hasn't mentioned another program writes his data (which is important information). 删除了关于如何有效地存储数据总是大端和交换到机器endianess的评论,因为提问者没有提到另一个程序写入他的数据(这是重要的信息)。

Still if the data needs conversion from any endian to big and from big to host endian, ntohs/ntohl/htons/htonl are the best methods, most elegant and unbeatable in speed (as they will perform task in hardware if CPU supports that, you can't beat that). 如果数据需要从任何端到大,从大到主端的转换,ntohs / ntohl / htons / htonl是最好的方法,最优雅和无与伦比的速度(因为如果CPU支持,它们将在硬件中执行任务,你不能打败那个)。


Regarding double/float, just store them to ints by memory casting: 关于double / float,只需通过内存转换将它们存储到int中:

double d = 3.1234;
printf("Double %f\n", d);
int64_t i = *(int64_t *)&d;
// Now i contains the double value as int
double d2 = *(double *)&i;
printf("Double2 %f\n", d2);

Wrap it into a function 将其包装成一个函数

int64_t doubleToInt64(double d)
{
    return *(int64_t *)&d;
}

double int64ToDouble(int64_t i)
{
    return *(double *)&i;
}

Questioner provided this link: 发问者提供了这个链接:

http://cocoawithlove.com/2008/04/using-pointers-to-recast-in-c-is-bad.html http://cocoawithlove.com/2008/04/using-pointers-to-recast-in-c-is-bad.html

as a prove that casting is bad... unfortunately I can only strongly disagree with most of this page. 作为一个证明铸造是坏的...不幸的是,我只能强烈反对这个页面的大部分内容。 Quotes and comments: 报价和评论:

As common as casting through a pointer is, it is actually bad practice and potentially risky code. 像通过指针进行投射一样常见,它实际上是不好的做法和潜在的风险代码。 Casting through a pointer has the potential to create bugs because of type punning. 通过指针进行转换有可能因类型惩罚而产生错误。

It is not risky at all and it is also not bad practice. 它根本没有风险,也是不错的做法。 It has only a potential to cause bugs if you do it incorrectly, just like programming in C has the potential to cause bugs if you do it incorrectly, so does any programming in any language. 如果你做错了,它只有可能导致错误,就像在C中编程有可能导致错误,如果你做错了,所以任何语言的编程也是如此。 By that argument you must stop programming altogether. 通过这个论点,你必须完全停止编程。

Type punning 打字
A form of pointer aliasing where two pointers and refer to the same location in memory but represent that location as different types. 指针别名的一种形式,其中两个指针指向内存中的相同位置,但将该位置表示为不同类型。 The compiler will treat both "puns" as unrelated pointers. 编译器会将“双关语”视为不相关的指针。 Type punning has the potential to cause dependency problems for any data accessed through both pointers. 类型惩罚有可能导致通过两个指针访问的任何数据的依赖性问题。

This is true, but unfortunately totally unrelated to my code . 这是事实,但遗憾的是与我的代码完全无关

What he refers to is code like this: 他所指的是这样的代码:

int64_t * intPointer;
:
// Init intPointer somehow
:
double * doublePointer = (double *)intPointer;

Now doublePointer and intPointer both point to the same memory location, but treating this as the same type. 现在,doublePointer和intPointer都指向相同的内存位置,但将其视为相同的类型。 This is the situation you should solve with a union indeed, anything else is pretty bad. 这是你应该用联盟解决的情况,其他任何事情都很糟糕。 Bad that is not what my code does! 不好,这不是我的代码所做的!

My code copies by value , not by reference . 我的代码按复制,而不是按引用复制。 I cast a double to int64 pointer (or the other way round) and immediately deference it. 我将一个double转换为int64指针(或反过来),并立即将推迟 Once the functions return, there is no pointer held to anything. 一旦函数返回,就没有任何指针。 There is a int64 and a double and these are totally unrelated to the input parameter of the functions. 有一个int64和一个double,它们与函数的输入参数完全无关。 I never copy any pointer to a pointer of a different type (if you saw this in my code sample, you strongly misread the C code I wrote), I just transfer the value to a variable of different type (in an own memory location). 我永远不会将任何指针复制到不同类型的指针(如果你在我的代码示例中看到这个,你强烈误读我写的C代码),我只是将值传递给不同类型的变量(在自己的内存位置) 。 So the definition of type punning does not apply at all, as it says "refer to the same location in memory" and nothing here refers to the same memory location. 因此,类型双关语的定义根本不适用,因为它表示“引用内存中的相同位置”,这里没有任何内容指的是相同的内存位置。

int64_t intValue = 12345;
double doubleValue = int64ToDouble(intValue);
// The statement below will not change the value of doubleValue!
// Both are not pointing to the same memory location, both have their
// own storage space on stack and are totally unreleated.
intValue = 5678;

My code is nothing more than a memory copy, just written in C without an external function. 我的代码只不过是一个内存副本,只是用C编写而没有外部函数。

int64_t doubleToInt64(double d)
{
    return *(int64_t *)&d;
}

Could be written as 可写成

int64_t doubleToInt64(double d)
{
    int64_t result;
    memcpy(&result, &d, sizeof(d));
    return result;
}

It's nothing more than that, so there is no type punning even in sight anywhere. 它只不过是这样,所以即使在任何地方都没有任何类型的惩罚。 And this operation is also totally safe, as safe as an operation can be in C. A double is defined to always be 64 Bit (unlike int it does not vary in size, it is fixed at 64 bit), hence it will always fit into a int64_t sized variable. 并且此操作也是完全安全的,因为操作可以在C中安全。双倍被定义为总是64位(与int不同,它的大小不变,它固定为64位),因此它总是适合到一个int64_t大小的变量。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM