简体   繁体   English

Linux C读取文件UNICODE格式的文本(记事本Windows)

[英]Linux C read file UNICODE formatted text (notepad Windows)

Is there a way to read a text file, under Linux with C, saved on Windows as "UNICODE" with notepad? 在带有C的Linux下,是否有一种方法可以读取文本文件,并在Windows中使用记事本将其保存为“ UNICODE”? The text in Linux with nano editor looks like: Linux中带有nano编辑器的文本如下所示:

��T^@e^@s^@t^@
^@

but under vi editor is read properly as: 但在vi编辑器下,其正确读取为:

Test

I must specify the text is normal strings ANSI (no Unicode characters or foreign languages related). 我必须指定文本为普通字符串ANSI(无Unicode字符或与外语有关)。 Tried like this but no result: 像这样尝试,但没有结果:

#include <stdio.h>
#include <wchar.h>
#include <locale.h>

int main() {
   char *loc = setlocale(LC_ALL, 0);
   setlocale(LC_ALL, loc);
   FILE * f = fopen("unicode.txt", "r");
   wint_t c;

   while((c = fgetwc(f)) != WEOF) {
      wprintf(L"%lc\n", c);
   }
   return 0;
}

UPDATE: 更新:

Forgot to mention the file format is Little-endian UTF-16 Unicode text or UTF-16LE 忘了提及文件格式是Little-endian UTF-16 Unicode text还是UTF-16LE

Include <wchar.h> , set an UTF-8 locale ( setlocale(LC_ALL, "en_US.UTF-8") is fine), open the file or stream in byte-oriented mode ( handle=fopen(filename, "rb") , fwide(handle,-1) , ie in not-wide mode). 包括<wchar.h> ,设置UTF-8语言环境( setlocale(LC_ALL, "en_US.UTF-8")可以),以字节为导向的模式打开文件或流( handle=fopen(filename, "rb")fwide(handle,-1) ,即在非宽模式下)。 Then you can use 那你可以用

wint_t getwc_utf16le(FILE *const in)
{
    int lo, hi, code, also;

    if ((lo = getc(in)) == EOF)
        return WEOF;

    if ((hi = getc(in)) == EOF)
        return lo; /* Or abort; input sequence ends prematurely */

    code = lo + 256 * hi;
    if (code < 0xD800 || code > 0xDBFF)
        return code; /* Or abort; input sequence is not UTF16-LE */

    if ((lo = getc(in)) == EOF)
        return code; /* Or abort; input sequence ends prematurely */

    if ((hi = getc(in)) == EOF) {
        ungetc(lo, in);
        return code; /* Or abort; input sequence ends prematurely */
    }

    /* Note: if ((lo + 256*hi) < 0xDC00 || (lo + 256*hi) > 0xDFFF)
     *       the input sequence is not valid UTF16-LE. */
    return 0x10000 + ((code & 0x3FF) << 10) + ((lo + 256 * hi) & 0x3FF);
}

to read code points from such an input file, assuming it contains UTF16-LE data. 假定其中包含UTF16-LE数据,则从这样的输入文件中读取代码点。

The above function is more permissive than strictly necessary, but it does parse all UTF16-LE I could throw at it (including the sometimes problematic U+100000..U+10FFFF code points), so if the input is correct, this function should handle it just fine. 上面的函数比严格必要的更为宽松,但它确实解析了我可能抛出的所有UTF16-LE(包括有时有问题的U + 100000..U + 10FFFF代码点),因此,如果输入正确,则此函数应处理就好了。

Because the locale is set to UTF-8 in Linux, and Linux implementations support the full Unicode set, the code points match the ones produced by above functions, and you can safely use wide character functions (from <wchar.h> ) to handle the input. 因为在Linux中语言环境设置为UTF-8,并且Linux实现支持完整的Unicode集,所以代码点与上述函数产生的代码点匹配,因此您可以安全地使用宽字符函数(来自<wchar.h> )来处理输入。

Often the first character in the file is BOM, "byte-order mark" , 0xFEFF . 文件中的第一个字符通常是BOM, “字节顺序标记” 0xFEFF You can ignore it if it is the first character in the file. 如果它是文件中的第一个字符,则可以忽略它。 Elsewhere it is the zero-width non-breaking space. 在其他地方是零宽度的不间断空间。 In my experience, those two bytes at the start of a file that is supposed to be text, is quite reliable indicator that the file is UTF16-LE. 以我的经验,应该是文本的文件开头的那两个字节非常可靠地指示该文件为UTF16-LE。 (So, you could peek at the first two bytes, and if they match those, assume it is UTF16-LE.) (因此,您可以查看前两个字节,如果它们匹配,则假定它是UTF16-LE。)

Remember that wide-character end-of-file is WEOF , not EOF . 请记住,宽字符文件结尾是WEOF ,而不是EOF

Hope this helps. 希望这可以帮助。


Edited 20150505: Here is a helper function one could use instead, to read inputs (using low-level unistd.h interface), converting to UTF-8: read_utf8.h : 编辑于20150505:这是一个可以替代使用的辅助函数,以读取输入(使用低级unistd.h接口),转换为UTF-8: read_utf8.h

#ifndef   READ_UTF8_H
#define   READ_UTF8_H

/* Read input from file descriptor fd,
 * convert it to UTF-8 (using "UTF8//TRANSLIT" iconv conversion),
 * and appending to the specified buffer.
 *    (*dataptr)   points to a dynamically allocated buffer (may reallocate),
 *    (*sizeptr)   points to the size allocated for that buffer,
 *    (*usedptr)   points to the amount of data already in the buffer.
 * You may initialize the values to NULL,0,0, in which case they will
 * be dynamically allocated as needed.
*/
int read_utf8(char **dataptr, size_t *sizeptr, size_t *usedptr, const int fd, const char *const charset);

#endif /* READ_UTF8_H */

read_utf8.c : read_utf8.c

#define  _POSIX_C_SOURCE 200809L
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <iconv.h>
#include <string.h>
#include <errno.h>

#define   INPUT_CHUNK  16384
#define   OUTPUT_CHUNK  8192

int read_utf8(char **dataptr, size_t *sizeptr, size_t *usedptr, const int fd, const char *const charset)
{
    char    *data;
    size_t   size;
    size_t   used;

    char    *input_data;
    size_t   input_size, input_head, input_tail;
    int      input_more;

    iconv_t  conversion = (iconv_t)-1;

    if (!dataptr || !sizeptr || !usedptr || fd == -1 || !charset || !*charset)
        return errno = EINVAL;

    if (*dataptr) {
        data = *dataptr;
        size = *sizeptr;
        used = *usedptr;
        if (used > size)
            return errno = EINVAL;
    } else {
        data = NULL;
        size = 0;
        used = 0;
    }

    conversion = iconv_open("UTF8//TRANSLIT", charset);
    if (conversion == (iconv_t)-1)
        return errno = ENOTSUP;

    input_size = INPUT_CHUNK;
    input_data = malloc(input_size);
    if (!input_data) {
        if (conversion != (iconv_t)-1)
            iconv_close(conversion);
        errno = ENOMEM;
        return 0;
    }
    input_head = 0;
    input_tail = 0;
    input_more = 1;

    while (1) {

        if (input_tail > input_head) {
            if (input_head > 0) {
                memmove(input_data, input_data + input_head, input_tail - input_head);
                input_tail -= input_head;
                input_head  = 0;
            }
        } else {
            input_head = 0;
            input_tail = 0;
        }

        if (input_more && input_tail < input_size) {
            ssize_t n;

            do {
                n = read(fd, input_data + input_tail, input_size - input_tail);
            } while (n == (ssize_t)-1 && errno == EINTR);

            if (n > (ssize_t)0)
                input_tail += n;
            else
            if (n == (ssize_t)0)
                input_more = 0;
            else
            if (n != (ssize_t)-1) {
                free(input_data);
                iconv_close(conversion);
                return errno = EIO;
            } else {
                const int errcode = errno;
                free(input_data);
                iconv_close(conversion);
                return errno = errcode;
            }
        }

        if (input_head == 0 && input_tail == 0)
            break;

        if (used + OUTPUT_CHUNK > size) {
            size = (used / (size_t)OUTPUT_CHUNK + (size_t)2) * (size_t)OUTPUT_CHUNK;
            data = realloc(data, size);
            if (!data) {
                free(input_data);
                iconv_close(conversion);
                return errno = ENOMEM;
            }
            *dataptr = data;
            *sizeptr = size;
        }

        {
            char   *source_ptr = input_data + input_head;
            size_t  source_len = input_tail - input_head;

            char   *target_ptr = data + used;
            size_t  target_len = size - used;

            size_t  n;

            n = iconv(conversion, &source_ptr, &source_len, &target_ptr, &target_len);
            if (n == (size_t)-1 && errno == EILSEQ) {
                free(input_data);
                iconv_close(conversion);
                return errno = EILSEQ;
            }

            if (source_ptr == input_data + input_head && target_ptr == data + used) {
                free(input_data);
                iconv_close(conversion);
                return errno = EDEADLK;
            }

            input_head = (size_t)(source_ptr - input_data);
            used = (size_t)(target_ptr - data);

            *usedptr = used;
        }
    }

    free(input_data);
    iconv_close(conversion);

    if (used + 16 >= size) {
        size = (used | 15) + 17;
        data = realloc(data, size);
        if (!data)
            return errno = ENOMEM;
        *dataptr = data;
        *sizeptr = size;
        memset(data + used, 0, size - used);
    } else
    if (used + 32 < size)
        memset(data + used, 0, size - used);
    else
        memset(data + used, 0, 32);

    return errno = 0;
}

and an example program, example.c , on how to use it: 以及有关如何使用它的示例程序example.c

#define  _POSIX_C_SOURCE 200809L
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <string.h>
#include <stdio.h>
#include <errno.h>
#include "read_utf8.h"

int main(int argc, char *argv[])
{
    char   *file_buffer = NULL;
    size_t  file_allocd = 0;
    size_t  file_length = 0;
    int     fd;

    if (argc != 3 || !strcmp(argv[1], "-h") || !strcmp(argv[1], "--help")) {
        fprintf(stderr, "\n");
        fprintf(stderr, "Usage: %s [ -h | --help ]\n", argv[0]);
        fprintf(stderr, "       %s FILENAME CHARSET\n", argv[0]);
        fprintf(stderr, "       %s FILENAME CHARSET//IGNORE\n", argv[0]);
        fprintf(stderr, "\n");
        return EXIT_FAILURE;
    }

    do {
        fd = open(argv[1], O_RDONLY | O_NOCTTY);
    } while (fd == -1 && errno == EINTR);
    if (fd == -1) {
        fprintf(stderr, "%s: %s.\n", argv[1], strerror(errno));
        return EXIT_FAILURE;
    }

    if (read_utf8(&file_buffer, &file_allocd, &file_length, fd, argv[2])) {
        if (errno == ENOTSUP)
            fprintf(stderr, "%s: Unsupported character set.\n", argv[2]);
        else
            fprintf(stderr, "%s: %s.\n", argv[1], strerror(errno));
        return EXIT_FAILURE;
    }

    errno = EIO;
    if (close(fd)) {
        fprintf(stderr, "%s: %s.\n", argv[1], strerror(errno));
        return EXIT_FAILURE;
    }

    fprintf(stderr, "%s: read %zu bytes, allocated %zu.\n", argv[1], file_length, file_allocd);
    if (file_length > 0)
        if (fwrite(file_buffer, file_length, 1, stdout) != 1) {
            fprintf(stderr, "Error writing to standard output.\n");
            return EXIT_FAILURE;
        }

    return EXIT_SUCCESS;
}

This lets you read (either into an empty, dynamically allocated buffer, or append to an existing dynamically allocated buffer) using any character set supported by your system (use iconv --list to see the list), auto-converting the contents to UTF-8. 这使您可以使用系统支持的任何字符集(使用iconv --list查看列表)读取(读取到空的动态分配的缓冲区中,或追加到现有的动态分配的缓冲区中),然后将内容自动转换为UTF -8。

It uses a temporary input buffer (of INPUT_CHUNK bytes) to read the file part by part, and reallocates the output buffer in multiples of OUTPUT_CHUNK bytes, keeping at least OUTPUT_CHUNK bytes available for each conversion. 它使用一个临时输入缓冲区( INPUT_CHUNK字节)来部分读取文件,并以OUTPUT_CHUNK字节的倍数重新分配输出缓冲区,并保持每次转换至少可用OUTPUT_CHUNK字节。 The constants may need a bit of tuning for different use cases; 对于不同的用例,这些常数可能需要一些调整。 they're by no means optimal or even suggested values. 它们绝不是最佳值,也不是建议值。 Larger ones lead to faster code, especially for INPUT_CHUNK , as most filesystems perform better when reading large chunks ( 2097152 is suggested size currently, if I/O performance is important) -- but you should have OUTPUT_CHUNK at similar size, or perhaps twice that, to reduce the number of reallocations needed. 较大的文件会导致更快的代码,尤其是对于INPUT_CHUNK ,因为大多数文件系统在读取大块时性能会更好(如果I / O性能很重要,则当前建议大小为2097152 )-但您应该使OUTPUT_CHUNK的大小相近,或者两倍,以减少所需的重新分配数量。 (You can trim the resulting buffer afterwards, to used+1 bytes, using realloc() , to avoid memory waste.) (之后,可以使用realloc()将生成的缓冲区修整为used+1个字节,以避免浪费内存。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM