简体   繁体   English

多字节语言环境中glibc的printf截断错误的解决方法?

[英]Workaround for glibc's printf truncation bug in multi-byte locales?

Certain GNU-based OS distros (Debian) are still impacted by a bug in GNU libc that causes the printf family of functions to return a bogus -1 when the specified level of precision would truncate a multi-byte character. 某些基于GNU的操作系统发行版(Debian)仍然受到GNU libc中的错误的影响,该错误导致printf系列函数在指定的精度级别截断多字节字符时返回伪造的-1 This bug was fixed in 2.17 and backported to 2.16. 此错误已在2.17中修复,并向后移植到2.16。 Debian has an archived bug for this, but the maintainers appear to have no intention of backporting the fix to the 2.13 used by Wheezy. Debian 有一个存档的错误 ,但维护者似乎无意将修复程序向后移植到Wheezy使用的2.13。

The text below is quoted from https://sourceware.org/bugzilla/show_bug.cgi?id=6530 . 以下文字引自https://sourceware.org/bugzilla/show_bug.cgi?id=6530 (Please do not edit the block quoting inline again.) (请不要再编辑内联引用的块。)

Here's a simpler testcase for this bug courtesy of Jonathan Nieder: 这是Jonathan Nieder提供的这个错误的简单测试用例:

#include <stdio.h>
#include <locale.h>

int main(void)
{
    int n;

    setlocale(LC_CTYPE, "");
    n = printf("%.11s\n", "Author: \277");
    perror("printf");
    fprintf(stderr, "return value: %d\n", n);
    return 0;
}

Under a C locale that'll do the right thing: 在C语言环境下,做正确的事情:

$ LANG=C ./test
Author: &#65533;
printf: Success
return value: 10

But not under a UTF-8 locale, since \\277 isn't a valid UTF-8 sequence: 但不是在UTF-8语言环境下,因为\\277不是有效的UTF-8序列:

$ LANG=en_US.utf8 ./test
printf: Invalid or incomplete multibyte or wide character

It's worth noting that printf will also overwrite the first character of the output array with \\0 in this context. 值得注意的是, printf也将在此上下文中使用\\0覆盖输出数组的第一个字符。

I am currently trying to retrofit a MUD codebase to support UTF-8, and unfortunately the code is riddled with cases where arbitrary sprintf precision is used to limit how much text is sent to output buffers. 我目前正在尝试改进MUD代码库以支持UTF-8,不幸的是,代码中充斥着使用任意sprintf精度来限制将多少文本发送到输出缓冲区的情况。 This problem is made much worse by the fact that most programmers don't expect a -1 return in this context, which can result in uninitialized memory reads and badness that cascades down from that. 大多数程序员都不希望在这种情况下返回-1 ,这可能会导致未初始化的内存读取和从中逐渐减少的不良情况。 (already caught a few cases in valgrind ) (已经在valgrind中发现了一些案例)

Has anyone come up with a concise workaround for this bug in their code that doesn't involve rewriting every single invocation of a formatting string with arbitrary length precision? 有没有人为他们的代码中的这个错误提出一个简明的解决方法,不涉及重写任意长度精度的格式化字符串的每次调用? I'm fine with truncated UTF-8 characters being written to my output buffer as it's fairly trivial to clean that up in my output processing prior to socket write, and it seems like overkill to invest this much effort in a problem that will eventually go away given a few more years. 我很好地将截断的UTF-8字符写入我的输出缓冲区,因为在套接字写入之前在我的输出处理中清理它是相当简单的,并且将这么多努力投入到最终会出现的问题中似乎有点过头了。离开了几年。

I'm guessing, and it seems to be confirmed by the the comments to the question, that you don't use all that much of the C library's locale specific functionality. 我猜,这个问题的评论似乎证实了你没有使用C库的语言环境特定功能。 In that case you'd probably be better off not changing the locale to a UTF-8 based one, and leaving it in the single-byte locale your code assumes. 在这种情况下,最好不要将语言环境更改为基于UTF-8的语言环境,并将其保留在代码所假定的单字节语言环境中。

When you do need to process UTF-8 strings as UTF-8 strings you can use specialized code. 当您确实需要将UTF-8字符串作为UTF-8字符串处理时,您可以使用专门的代码。 It's not too hard to write your own UTF-8 processing routines. 编写自己的UTF-8处理例程并不难。 You can even download the Unicode Character Database and do some fairly sophisticated character classification. 您甚至可以下载Unicode字符数据库并进行一些相当复杂的字符分类。 If you'd prefer to use a third party library to handle UTF-8 strings there's ICU as you mentioned in your comments. 如果您更喜欢使用第三方库来处理UTF-8字符串,那么就像您在评论中提到的ICU一样。 It's a pretty heavyweight library though, a previous question recommends a few lighter weight alternatives . 这是一个非常重量级的图书馆,之前的一个问题建议使用更轻量级的替代品

It might also be possible to switch the C locale back and forth as necessary so you can use the C library's functionality. 也可以根据需要来回切换C语言环境,以便可以使用C库的功能。 You'll want to check the performance impact of this however, as switching locales can be an expensive operation. 但是,您需要检查这对性能的影响,因为切换区域设置可能是一项昂贵的操作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM