简体   繁体   English

我如何解决使用包含重音符号的字符串(ASCII到UTF-8)的DOS函数?

[英]How am I allowed to workaround DOS functions that used strings containing accented characters (ASCII to UTF-8)?

I was writing a SW where I wanted to use an old C code written in the early '80. 我在编写SW时想使用80年代初编写的旧C代码。 This code did some conversion on strings. 这段代码对字符串做了一些转换。 It also used the accented characters that, at that time (DOS), were coded in the ASCII table (codes bigger than 127). 它还使用当时在ASCII表(代码大于127)中编码的重音字符(DOS)。

Now the new systems use UTF-8 encoding, so the old code works very badly. 现在,新系统使用UTF-8编码,因此旧代码无法正常工作。 I am using Linux (Ubuntu 17 / gcc gcc (Ubuntu 7.2.0-8ubuntu3) 7.2.0). 我正在使用Linux(Ubuntu 17 / gcc gcc(Ubuntu 7.2.0-8ubuntu3)7.2.0)。

I'm looking for a workaround allowing me to make the least possible changes. 我正在寻找一种解决方法,使我能够进行尽可能少的更改。 I have begun to do some tests to analyze the arisen issues. 我已经开始做一些测试来分析出现的问题。 I made two main : one uses char * strings and char elements, another uses wchar_t * strings and wchar_t elements. 我做了两个main :一个使用char *字符串和char元素,另一个使用wchar_t *字符串和wchar_t元素。 Both don't work correctly. 两者均无法正常工作。

The first (using char * and char ) requires, in example, a workaround when strchr recognizes multi-byte code, it doesn't prints ( printf ) the multi-byte char in the correct way, althoug prints correctly the char * . 例如,第一个(使用char *char )需要一种变通方法,当strchr识别多字节代码时,它不能以正确的方式打印( printf )多字节char,althoug可以正确地打印char * Furthermore generates a lot of warnings relavant to the use of multibyte chars. 此外,还会产生许多与使用多字节字符有关的警告。

The second (using wchar_t * and char * ) runs, but doesn't prints correctly the multi-bytes characters, they appear as '?' 第二个(使用wchar_t *char * )运行,但是不能正确打印多字节字符,它们显示为“?” both when they are printed as wchar_t and as wchar_t * (strings). 当它们分别打印为wchar_t和wchar_t *(字符串)时。

MAIN1: MAIN1:

#include <stdio.h>
#include <string.h>
#include <inttypes.h>

/* http://clc-wiki.net/wiki/strchr
 * standard C implementation
 */
char *_strchr(const char *s, int c);

char *_strchr(const char *s, int c)
{
    while (*s != (char)c)
        if (!*s++)
            return 0;
    return (char *)s;
}


int main()
{
    char          * p1 = NULL;
    const char    * t1 = "Sergio è un Italiano e andò via!";

    printf("Text --> %s\n\n",t1);

    for(size_t i=0;i<strlen(t1);i++) {
        printf("%02X %c|",(uint8_t)t1[i],t1[i]);
    }
    puts("\n");

    puts("Searching ò");
    /*warning: multi-character character constant [-Wmultichar]
                      p1 = strchr(t1,'ò');
                                     ^~~~
    */
    p1 = strchr(t1,'ò');
    printf("%s\n",p1-1); // -1 needs to correct the position

    /*warning: multi-character character constant [-Wmultichar]
                      p1 = _strchr(t1,'ò');
                                     ^~~~
    */
    p1 = _strchr(t1,'ò');
    printf("%s\n",p1-1);    // -1 needs to correct the position
    puts("");

    puts("Searching è");
    /*warning: multi-character character constant [-Wmultichar]
                      p1 = strchr(t1,'è');
                                     ^~~~
    */
    p1 = strchr(t1,'è');
    printf("%s\n",p1-1);    // -1 needs to correct the position

    /*warning: multi-character character constant [-Wmultichar]
                      p1 = _strchr(t1,'è');
                                     ^~~~
    */
    p1 = _strchr(t1,'è');
    printf("%s\n",p1-1);    // -1 needs to correct the position
    puts("");

    /*warning: multi-character character constant [-Wmultichar]
         printf("%c %c %08X %08X\n",'è','ò','è','ò');
                                    ^~~~
         printf("%c %c %08X %08X\n",'è','ò','è','ò');
                                        ^~~~
         printf("%c %c %08X %08X\n",'è','ò','è','ò');
                                            ^~~~
         printf("%c %c %08X %08X\n",'è','ò','è','ò');
                                                ^~~~
    */
    printf("%c %c %08X %08X\n",'è','ò','è','ò');

    /*multi-character character constant [-Wmultichar]
     printf("%c %c %08X %08X\n",'è','ò',(uint8_t)'è',(uint8_t)'ò');
                                ^~~~
     printf("%c %c %08X %08X\n",'è','ò',(uint8_t)'è',(uint8_t)'ò');
                                    ^~~~
     printf("%c %c %08X %08X\n",'è','ò',(uint8_t)'è',(uint8_t)'ò');
                                                 ^~~~
     printf("%c %c %08X %08X\n",'è','ò',(uint8_t)'è',(uint8_t)'ò');
                                                              ^~~~
    */
    printf("%c %c %08X %08X\n",'è','ò',(uint8_t)'è',(uint8_t)'ò');

    puts("");
    return 0;
}

Output: 输出:

MAIN1输出

MAIN2: MAIN2:

#include <stdio.h>
#include <string.h>
#include <wchar.h>
#include <inttypes.h>

#define wputs(s) wprintf(s"\n")

/* https://opensource.apple.com/source/Libc/Libc-498.1.1/string/wcschr-fbsd.c
 * FBSD C implementation
 */
wchar_t * _wcschr(const wchar_t *s, wchar_t c);

wchar_t * _wcschr(const wchar_t *s, wchar_t c)
{
    while (*s != c && *s != L'\0')
        s++;
    if (*s == c)
        return ((wchar_t *)s);
    return (NULL);
}

int main()
{
    wchar_t       * p1 = NULL;
    const wchar_t * t1 = L"Sergio è un Italiano e andò via!";
    const wchar_t * f0 = L"%02X %c|";
    const wchar_t * f1 = L"Text --> %ls\n\n";
    const wchar_t * f2 = L"%ls\n";

    uint8_t * p = (uint8_t *)t1;

    wprintf(f1,t1);

    for(size_t i=0;;i++) {
        uint8_t c=*(p+i);

        wprintf(f0,c,(c<' ')?'.':(c>127)?'*':c);
        if ( c=='!' )
            break;
    }
    wputs(L"\n");

    wputs(L"Searching ò");

    p1 = wcschr(t1,L'ò');
    wprintf(f2,p1);

    p1 = _wcschr(t1,L'ò');
    wprintf(f2,p1);
    wputs(L"---");

    wputs(L"Searching è");

    p1 = wcschr(t1,L'è');
    wprintf(f2,p1);

    p1 = _wcschr(t1,L'è');
    wprintf(f2,p1);
    wputs(L"");

    wprintf(L"%lc %lc %08X %08X\n",L'è',L'ò',L'è',L'ò');
    wprintf(L"%lc %lc %08X %08X\n",L'è',L'ò',(uint8_t)L'è',(uint8_t)L'ò');

    wputs(L"");

    return 0;
}

Output: 输出:

MAIN2输出

You need to localize your program, if you want to use wide-character I/O. 如果要使用宽字符I / O,则需要本地化程序。 It's not difficult, just a setlocale() call, plus optionally fwide() to see if the user locale supports wide I/O on the desired stream(s). 这并不困难,只需调用setlocale()以及可选的fwide()即可查看用户区域设置是否支持所需流上的宽I / O。

In your main() , before any input/output, run main() ,在任何输入/输出之前运行

    if (!setlocale(LC_ALL, "")) {
        /* Current locale is not supported
           by the C library; abort. */
    }

As the comment says, this tells your C library, that this program is locale-aware, and that it should do the setup and preparations needed to follow the rules of the locale the user has set up. 就像注释中所说的,这告诉您的C库,该程序可以识别语言环境,并且应该按照用户设置的语言环境规则进行设置和准备。 See man 7 locale for further information. 有关更多信息,请参见man 7语言环境 Essentially, the C library does not automatically pick up the current locale the user has set up, but uses the default C/POSIX locale. 本质上,C库不会自动选择用户设置的当前语言环境,而是使用默认的C / POSIX语言环境。 This command tells the C library to try and conform to the currently set up locale. 此命令告诉C库尝试并符合当前设置的语言环境。

In POSIX C, each FILE handle has an orientation , that can be queried and set (but only before reading or writing to it) using fwide() . 在POSIX C中,每个FILE句柄都有一个方向 ,可以使用fwide()来查询和设置该方向 (但仅在对其进行读取或写入之前fwide() Note that it is a property of the file handle, not files themselves; 注意,它是文件句柄的属性,而不是文件本身。 and it only determines whether the C library uses byte-oriented (normal/narrow) or wide-character functions to read from and write to the stream. 并且它仅确定C库是使用面向字节的(正常/窄)还是宽字符函数来读取和写入流。 If you don't call it, the C library tries to do it automatically based on the first read/write function you use to access the stream, if the locale has been set. 如果未调用它,则C库会尝试根据用于访问流的第一个读/写功能( 如果已设置语言环境)自动执行此操作。 However, using for example 但是,使用例如

    if (fwide(stdout, 1) <= 0) {
        /* The C library does not support wide-character
           orientation for standard output in this locale.
           Abort.
        */
    }

after the locale setup, means you can detect if the C library does not support the user locale or if the user locale does not support wide characters at all, for that particular stream; 在设置语言环境之后,意味着您可以针对该特定流检测C库是否不支持用户语言环境或用户语言环境根本不支持宽字符; and abort the program. 并终止程序。 (It is always better to tell the user that the results would be garbage, than silently try to do your best, and possibly garble the user data. The user can, after all, always use a different tool; but silently garbling the user data means this particular tool would simply be untrustworthy: worthless.) (最好总是告诉用户结果将是垃圾,而不是默默地尽力而为,可能会使用户数据乱码。毕竟,用户始终可以使用其他工具;而默默地使用户数据乱码表示此特定工具简直是不可信赖的:一文不值。)

You must not mix wprintf() and printf() ; 您不得混用wprintf()printf() nor fwprintf() and fprintf() to the same stream. 也不fwprintf()fprintf()移至同一流。 It either fails (does not print anything), confuses the C library, or produces garbled results. 它要么失败(不打印任何内容),要么混淆C库,要么产生乱码。 Similarly, you must not mix fgetc() and fgetwc() on the same stream. 同样,您不得在同一流上混合使用fgetc()fgetwc() Simply put, you must not mix byte-oriented or wide-character-oriented functions on the same stream. 简而言之,您不得在同一流上混合面向字节或面向宽字符的函数。

This does not mean that you cannot print a byte-oriented (or multibyte) string to a wide-character-oriented stream, or vice versa; 这并不意味着您不能将面向字节(或多字节)的字符串打印到面向宽字符的流,反之亦然; quite the opposite. 恰恰相反。 It works very logically, %s and %c always refer to a byte-oriented string or character, and %ls and %lc a wide string or character. 它在逻辑上起作用, %s%c始终引用面向字节的字符串或字符, %ls%lc引用宽字符串或字符。 For example, if you have 例如,如果您有

const wchar_t *ws = L"Hello";
const char     *s = "world!";

you can print them both to byte-oriented standard output using 您可以使用以下命令将它们都打印到面向字节的标准输出中

printf("%ls, %s\n", ws, s);

or to a wide-character-oriented standard output using 或使用以下命令生成面向宽字符的标准输出

wprintf(L"%ls, %s\n", ws, s);

This is basically a limitation in the POSIX C library: you must use byte-oriented functions for byte-oriented streams, and wide-character oriented functions for wide-character oriented streams. 这基本上是POSIX C库中的一个限制:必须将面向字节的函数用于字节流,而将宽字符的函数用于宽字符流。 It might feel weird at first, but if you think about it, it's very clear and simple rule. 刚开始时可能会觉得很奇怪,但是如果您考虑一下,这是非常简单明了的规则。


Let's look at an example program roughly similar to yours; 让我们看一个与您的程序大致相似的示例程序。 expanded to read the (unlimited-length) strings line by line from standard input, using any newline convention (CR, LF, CRLF, LFCR): 扩展为使用任何换行符惯例(CR,LF,CRLF,LFCR)从标准输入逐行读取(无限长)字符串:

#define _POSIX_C_SOURCE 200809L
#include <stdlib.h>
#include <locale.h>
#include <wchar.h>
#include <string.h>
#include <errno.h>
#include <stdio.h>

/* Function to read a wide-character line,
   using any newline convention, skipping embedded NUL bytes (L'\0'),
   and dynamically reallocating the buffer as needed.
   If *lineptr==NULL and *sizeptr==0, the buffer is dynamically allocated.
   Returns the number of wide characters read.
   If an error occurs, returns zero, with errno set.
   At end of input, returns zero, with errno zero.
*/
size_t wide_line(wchar_t **lineptr, size_t *sizeptr, FILE *in)
{
    wchar_t *line;
    size_t   size, used = 0;
    wint_t   wc;

    if (!lineptr || !sizeptr) {
        errno = EINVAL;
        return 0;
    }
    if (ferror(in)) {
        errno = EIO;
        return 0;
    }

    if (*sizeptr) {
        line = *lineptr;
        size = *sizeptr;
    } else {
        *lineptr = line = NULL;
        *sizeptr = size = 0;
    }

    while (1) {

        if (used + 3 >= size) {
            /* Conservative dynamic memory reallocation policy. */
            if (used < 126)
                size = 128;
            else
            if (used < 2097152)
                size = (used * 3) / 2;
            else
                size = (used | 1048575) + 1048579;

            /* Check for size overflow. */
            if (used + 2 >= size) {
                errno = ENOMEM;
                return 0;
            }

            line = realloc(line, size * sizeof line[0]);
            if (!line) {
                errno = ENOMEM;
                return 0;
            }

            *lineptr = line;
            *sizeptr = size;
        }

        wc = fgetwc(in);
        if (wc == WEOF) {
            line[used] = L'\0';
            errno = 0;
            return used;

        } else
        if (wc == L'\n') {
            line[used++] = L'\n';

            wc = fgetwc(in);
            if (wc == L'\r')
                line[used++] = L'\r';
            else
            if (wc != WEOF)
                ungetwc(wc, in);

            line[used] = L'\0';
            errno = 0;
            return used;

        } else
        if (wc == L'\r') {
            line[used++] = L'\r';

            wc = fgetwc(in);
            if (wc == L'\n')
                line[used++] = L'\n';
            else
            if (wc != WEOF)
                ungetwc(wc, in);

            line[used] = L'\0';
            errno = 0;
            return used;
        } else
        if (wc != L'\0')
            line[used++] = wc;
    }
}

/* Returns a dynamically allocated wide string,
   with contents from a multibyte string. */
wchar_t *dup_mbstowcs(const char *src)
{
    if (src && *src) {
        wchar_t *dst;
        size_t   len, check;

        len = mbstowcs(NULL, src, 0);
        if (len == (size_t)-1) {
            errno = EILSEQ;
            return NULL;
        }

        dst = malloc((len + 1) * sizeof *dst);
        if (!dst) {
            errno = ENOMEM;
            return NULL;
        }

        check = mbstowcs(dst, src, len + 1);
        if (check != len) {
            free(dst);
            errno = EILSEQ;
            return NULL;
        }

        /* Be paranoid, and ensure the string is terminated. */
        dst[len] = L'\0';
        return dst;

    } else {
        wchar_t *empty;

        empty = malloc(sizeof *empty);
        if (!empty) {
            errno = ENOMEM;
            return NULL;
        }

        *empty = L'\0';
        return empty;
    }
}

int main(int argc, char *argv[])
{
    wchar_t **argw;
    wchar_t  *line = NULL;
    size_t    size = 0;
    size_t    len;
    int       arg;

    if (!setlocale(LC_ALL, "")) {
        fprintf(stderr, "Current locale is unsupported.\n");
        return EXIT_FAILURE;
    }

    if (fwide(stdin, 1) <= 0) {
        fprintf(stderr, "Standard input does not support wide characters.\n");
        return EXIT_FAILURE;
    }

    if (fwide(stdout, 1) <= 0) {
        fprintf(stderr, "Standard output does not support wide characters.\n");
        return EXIT_FAILURE;
    }

    if (argc < 2) {
        fprintf(stderr, "\n");
        fprintf(stderr, "Usage: %s WIDE-CHARACTER [ WIDE-CHARACTER ... ]\n", argv[0]);
        fprintf(stderr, "\n");
        fprintf(stderr, "This program will look for the first instance of each wide character\n");
        fprintf(stderr, "in each line of input.\n");
        return EXIT_SUCCESS;
    }

    /* Convert command-line arguments to wide character strings. */
    argw = malloc((size_t)(argc + 1) * sizeof *argw);
    if (!argw) {
        fprintf(stderr, "Out of memory.\n");
        return EXIT_FAILURE;
    }
    for (arg = 0; arg < argc; arg++) {
        argw[arg] = dup_mbstowcs(argv[arg]);
        if (!argw[arg]) {
            fprintf(stderr, "Error converting argv[%d]: %s.\n", arg, strerror(errno));
            return EXIT_FAILURE;
        }
    }
    argw[argc] = NULL;

    while (1) {

        len = wide_line(&line, &size, stdin);
        if (!len) {
            if (errno) {
                fprintf(stderr, "Error reading standard input: %s.\n", strerror(errno));
                return EXIT_FAILURE;
            } else
            if (ferror(stdin)) {
                fprintf(stderr, "Error reading standard input.\n");
                return EXIT_FAILURE;
            }
            /* It was just an end of file, no error. */
            break;
        }

        for (arg = 1; arg < argc; arg++)
            if (argw[arg][0] != L'\0') {
                wchar_t  *pos = wcschr(line, argw[arg][0]);
                if (pos) {
                    size_t  i = (size_t)(pos - line);

                    fputws(line, stdout);
                    wprintf(L"%*lc\n", (int)(i + 1), argw[arg][0]);
                }
            }

    }

    /* Because we are exiting the program,
       we don't *need* to free the line buffer we used.
       However, this is completely safe,
       and this is the way you should free the buffer. */
    free(line);
    line = NULL;
    size = 0;

    return EXIT_SUCCESS;
}

Because POSIX has not standardized the wide-character version of getline() , we implement our own variant as wide_line() . 由于POSIX尚未对getline()的宽字符版本进行标准化,因此我们将自己的变体实现为wide_line() It supports all four newline conventions, and returns a size_t ; 它支持所有四个换行符约定,并返回size_t 0 (with errno set) if an error occurs. 如果发生错误,则为0errno )。

Because of the universal newline support, wide_line is not well suited for interactive input, as it tends to be one character "late". 由于支持通用换行符,因此wide_line不太适合交互式输入,因为它往往是一个字符“晚期”。 (For line-buffered input, as terminals tend to be, that means one full line late.) (对于行缓冲输入,随着端子的增加,这意味着晚整整行。)

I included the wide_line() implementation, because it, or something very much like it, solves most of problems when reading wide-input files that were written on various systems. 我包含了wide_line()实现,因为它或非常类似的东西可以解决读取在各种系统上编写的宽输入文件时的大多数问题。

The dup_mbstowcs() function is most useful when the command line parameters are needed as wide character strings. 当需要命令行参数作为宽字符串时, dup_mbstowcs()函数最有用。 It simply does the conversion to a dynamically allocated buffer. 它只是简单地转换为动态分配的缓冲区。 Essentially, argw[] is the wide-character copy of argv[] array. 本质上, argw[]argv[]数组的宽字符副本。

Other than those two functions, and the code that creates the argw[] array, there is not much code at all. 除了这两个函数以及用于创建argw[]数组的代码外,根本没有多少代码。 (Feel free to poach the functions, or the entire code, to be used in your own projects later on; I consider the code to be in Public Domain .) (以后可以随意在自己的项目中使用这些功能或整个代码;我认为这些代码在Public Domain中 。)

If you save the above as example.c , you can compile it using eg 如果将以上内容另存为example.c ,则可以使用例如eg进行编译

gcc -Wall -O2 example.c -o example

If you run eg 如果您运行例如

printf 'Sergio è un Italiano e andò via!\n' | ./example 'o' 'ò' 'è'

the output will be 输出将是

Sergio è un Italiano e andò via!
     o
Sergio è un Italiano e andò via!
                          ò
Sergio è un Italiano e andò via!
       è

The indentation "trick" is that if i is the position you want the wide character to be printed at, then (i+1) is the width of that logical field. 缩进“技巧”是,如果i是要在其上打印宽字符的位置,则(i+1)是该逻辑字段的宽度。 When we use * as the width field in the print specification, the width is read from an int parameter preceding the actual parameter being printed. 当我们在打印规范中使用*作为宽度字段时,宽度是从要打印的实际参数之前的int参数读取的。

You need to convert to and from the expected character encodings. 您需要在预期的字符编码之间进行转换。 Say the old system expects some Windows code page, and the new code expects UTF-8. 假设旧系统需要一些Windows代码页,而新代码则需要UTF-8。 Then to call old functions from the new stuff you need to: 然后,要从新内容中调用旧功能,您需要:

  1. Check you can perform the conversion safely (the input may contain characters which cannot be represented in the desired Windows code page form)... 检查您可以安全地执行转换(输入内容可能包含无法以所需的Windows代码页格式表示的字符)...
  2. Convert from UTF-8 to the desired Windows code page representation. 从UTF-8转换为所需的Windows代码页表示形式。 This should yield a new buffer/string in the compatible representation (a copy). 这将在兼容表示形式(副本)中产生一个新的缓冲区/字符串。
  3. Call the old code with the newly converted representation of the original argument 使用原始参数的新转换表示形式调用旧代码
  4. Receive the output in some buffer, it will be in the Windows code page representation. 在某个缓冲区中接收输出,它将以Windows代码页的形式表示。
  5. So convert that output into a UTF-8 copy. 因此,将该输出转换为UTF-8副本。
  6. Cleanup the temporary copy of the input, the original output buffer from the old code. 从旧代码中清除输入的临时副本(原始输出缓冲区)。
  7. Return the converted UTF-8 output copy to the new code. 将转换后的UTF-8输出副本返回到新代码。

And you'd need to do the reverse dance if you want to call the new UTF-8 code from the old stuff. 如果您想从旧版本中调用新的UTF-8代码,则需要进行反向跳舞。

EDIT: Note that your old system cannot have been expecting purely ASCII, because ASCII is a 7-bit encoding, and UTF-8 is explicitly backwards compatible with that. 编辑:请注意,您的旧系统不可能一直期望纯ASCII,因为ASCII是7位编码,而UTF-8明确地向后兼容。 So your first task is to correct your understanding of what is the actual encoding being used. 所以,你的第一个任务是纠正你的什么被实际使用的编码的理解。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM