[英]Linux C read file UNICODE formatted text (notepad Windows)
Is there a way to read a text file, under Linux with C, saved on Windows as "UNICODE" with notepad? 在带有C的Linux下,是否有一种方法可以读取文本文件,并在Windows中使用记事本将其保存为“ UNICODE”? The text in Linux with
nano
editor looks like: Linux中带有
nano
编辑器的文本如下所示:
��T^@e^@s^@t^@
^@
but under vi
editor is read properly as: 但在
vi
编辑器下,其正确读取为:
Test
I must specify the text is normal strings ANSI (no Unicode characters or foreign languages related). 我必须指定文本为普通字符串ANSI(无Unicode字符或与外语有关)。 Tried like this but no result:
像这样尝试,但没有结果:
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
int main() {
char *loc = setlocale(LC_ALL, 0);
setlocale(LC_ALL, loc);
FILE * f = fopen("unicode.txt", "r");
wint_t c;
while((c = fgetwc(f)) != WEOF) {
wprintf(L"%lc\n", c);
}
return 0;
}
UPDATE: 更新:
Forgot to mention the file format is Little-endian UTF-16 Unicode text
or UTF-16LE
忘了提及文件格式是
Little-endian UTF-16 Unicode text
还是UTF-16LE
Include <wchar.h>
, set an UTF-8 locale ( setlocale(LC_ALL, "en_US.UTF-8")
is fine), open the file or stream in byte-oriented mode ( handle=fopen(filename, "rb")
, fwide(handle,-1)
, ie in not-wide mode). 包括
<wchar.h>
,设置UTF-8语言环境( setlocale(LC_ALL, "en_US.UTF-8")
可以),以字节为导向的模式打开文件或流( handle=fopen(filename, "rb")
, fwide(handle,-1)
,即在非宽模式下)。 Then you can use 那你可以用
wint_t getwc_utf16le(FILE *const in)
{
int lo, hi, code, also;
if ((lo = getc(in)) == EOF)
return WEOF;
if ((hi = getc(in)) == EOF)
return lo; /* Or abort; input sequence ends prematurely */
code = lo + 256 * hi;
if (code < 0xD800 || code > 0xDBFF)
return code; /* Or abort; input sequence is not UTF16-LE */
if ((lo = getc(in)) == EOF)
return code; /* Or abort; input sequence ends prematurely */
if ((hi = getc(in)) == EOF) {
ungetc(lo, in);
return code; /* Or abort; input sequence ends prematurely */
}
/* Note: if ((lo + 256*hi) < 0xDC00 || (lo + 256*hi) > 0xDFFF)
* the input sequence is not valid UTF16-LE. */
return 0x10000 + ((code & 0x3FF) << 10) + ((lo + 256 * hi) & 0x3FF);
}
to read code points from such an input file, assuming it contains UTF16-LE data. 假定其中包含UTF16-LE数据,则从这样的输入文件中读取代码点。
The above function is more permissive than strictly necessary, but it does parse all UTF16-LE I could throw at it (including the sometimes problematic U+100000..U+10FFFF code points), so if the input is correct, this function should handle it just fine. 上面的函数比严格必要的更为宽松,但它确实解析了我可能抛出的所有UTF16-LE(包括有时有问题的U + 100000..U + 10FFFF代码点),因此,如果输入正确,则此函数应处理就好了。
Because the locale is set to UTF-8 in Linux, and Linux implementations support the full Unicode set, the code points match the ones produced by above functions, and you can safely use wide character functions (from <wchar.h>
) to handle the input. 因为在Linux中语言环境设置为UTF-8,并且Linux实现支持完整的Unicode集,所以代码点与上述函数产生的代码点匹配,因此您可以安全地使用宽字符函数(来自
<wchar.h>
)来处理输入。
Often the first character in the file is BOM, "byte-order mark" , 0xFEFF
. 文件中的第一个字符通常是BOM, “字节顺序标记”
0xFEFF
。 You can ignore it if it is the first character in the file. 如果它是文件中的第一个字符,则可以忽略它。 Elsewhere it is the zero-width non-breaking space.
在其他地方是零宽度的不间断空间。 In my experience, those two bytes at the start of a file that is supposed to be text, is quite reliable indicator that the file is UTF16-LE.
以我的经验,应该是文本的文件开头的那两个字节非常可靠地指示该文件为UTF16-LE。 (So, you could peek at the first two bytes, and if they match those, assume it is UTF16-LE.)
(因此,您可以查看前两个字节,如果它们匹配,则假定它是UTF16-LE。)
Remember that wide-character end-of-file is WEOF
, not EOF
. 请记住,宽字符文件结尾是
WEOF
,而不是EOF
。
Hope this helps. 希望这可以帮助。
Edited 20150505: Here is a helper function one could use instead, to read inputs (using low-level unistd.h
interface), converting to UTF-8: read_utf8.h
: 编辑于20150505:这是一个可以替代使用的辅助函数,以读取输入(使用低级
unistd.h
接口),转换为UTF-8: read_utf8.h
:
#ifndef READ_UTF8_H
#define READ_UTF8_H
/* Read input from file descriptor fd,
* convert it to UTF-8 (using "UTF8//TRANSLIT" iconv conversion),
* and appending to the specified buffer.
* (*dataptr) points to a dynamically allocated buffer (may reallocate),
* (*sizeptr) points to the size allocated for that buffer,
* (*usedptr) points to the amount of data already in the buffer.
* You may initialize the values to NULL,0,0, in which case they will
* be dynamically allocated as needed.
*/
int read_utf8(char **dataptr, size_t *sizeptr, size_t *usedptr, const int fd, const char *const charset);
#endif /* READ_UTF8_H */
read_utf8.c
: read_utf8.c
:
#define _POSIX_C_SOURCE 200809L
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <iconv.h>
#include <string.h>
#include <errno.h>
#define INPUT_CHUNK 16384
#define OUTPUT_CHUNK 8192
int read_utf8(char **dataptr, size_t *sizeptr, size_t *usedptr, const int fd, const char *const charset)
{
char *data;
size_t size;
size_t used;
char *input_data;
size_t input_size, input_head, input_tail;
int input_more;
iconv_t conversion = (iconv_t)-1;
if (!dataptr || !sizeptr || !usedptr || fd == -1 || !charset || !*charset)
return errno = EINVAL;
if (*dataptr) {
data = *dataptr;
size = *sizeptr;
used = *usedptr;
if (used > size)
return errno = EINVAL;
} else {
data = NULL;
size = 0;
used = 0;
}
conversion = iconv_open("UTF8//TRANSLIT", charset);
if (conversion == (iconv_t)-1)
return errno = ENOTSUP;
input_size = INPUT_CHUNK;
input_data = malloc(input_size);
if (!input_data) {
if (conversion != (iconv_t)-1)
iconv_close(conversion);
errno = ENOMEM;
return 0;
}
input_head = 0;
input_tail = 0;
input_more = 1;
while (1) {
if (input_tail > input_head) {
if (input_head > 0) {
memmove(input_data, input_data + input_head, input_tail - input_head);
input_tail -= input_head;
input_head = 0;
}
} else {
input_head = 0;
input_tail = 0;
}
if (input_more && input_tail < input_size) {
ssize_t n;
do {
n = read(fd, input_data + input_tail, input_size - input_tail);
} while (n == (ssize_t)-1 && errno == EINTR);
if (n > (ssize_t)0)
input_tail += n;
else
if (n == (ssize_t)0)
input_more = 0;
else
if (n != (ssize_t)-1) {
free(input_data);
iconv_close(conversion);
return errno = EIO;
} else {
const int errcode = errno;
free(input_data);
iconv_close(conversion);
return errno = errcode;
}
}
if (input_head == 0 && input_tail == 0)
break;
if (used + OUTPUT_CHUNK > size) {
size = (used / (size_t)OUTPUT_CHUNK + (size_t)2) * (size_t)OUTPUT_CHUNK;
data = realloc(data, size);
if (!data) {
free(input_data);
iconv_close(conversion);
return errno = ENOMEM;
}
*dataptr = data;
*sizeptr = size;
}
{
char *source_ptr = input_data + input_head;
size_t source_len = input_tail - input_head;
char *target_ptr = data + used;
size_t target_len = size - used;
size_t n;
n = iconv(conversion, &source_ptr, &source_len, &target_ptr, &target_len);
if (n == (size_t)-1 && errno == EILSEQ) {
free(input_data);
iconv_close(conversion);
return errno = EILSEQ;
}
if (source_ptr == input_data + input_head && target_ptr == data + used) {
free(input_data);
iconv_close(conversion);
return errno = EDEADLK;
}
input_head = (size_t)(source_ptr - input_data);
used = (size_t)(target_ptr - data);
*usedptr = used;
}
}
free(input_data);
iconv_close(conversion);
if (used + 16 >= size) {
size = (used | 15) + 17;
data = realloc(data, size);
if (!data)
return errno = ENOMEM;
*dataptr = data;
*sizeptr = size;
memset(data + used, 0, size - used);
} else
if (used + 32 < size)
memset(data + used, 0, size - used);
else
memset(data + used, 0, 32);
return errno = 0;
}
and an example program, example.c
, on how to use it: 以及有关如何使用它的示例程序
example.c
:
#define _POSIX_C_SOURCE 200809L
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <string.h>
#include <stdio.h>
#include <errno.h>
#include "read_utf8.h"
int main(int argc, char *argv[])
{
char *file_buffer = NULL;
size_t file_allocd = 0;
size_t file_length = 0;
int fd;
if (argc != 3 || !strcmp(argv[1], "-h") || !strcmp(argv[1], "--help")) {
fprintf(stderr, "\n");
fprintf(stderr, "Usage: %s [ -h | --help ]\n", argv[0]);
fprintf(stderr, " %s FILENAME CHARSET\n", argv[0]);
fprintf(stderr, " %s FILENAME CHARSET//IGNORE\n", argv[0]);
fprintf(stderr, "\n");
return EXIT_FAILURE;
}
do {
fd = open(argv[1], O_RDONLY | O_NOCTTY);
} while (fd == -1 && errno == EINTR);
if (fd == -1) {
fprintf(stderr, "%s: %s.\n", argv[1], strerror(errno));
return EXIT_FAILURE;
}
if (read_utf8(&file_buffer, &file_allocd, &file_length, fd, argv[2])) {
if (errno == ENOTSUP)
fprintf(stderr, "%s: Unsupported character set.\n", argv[2]);
else
fprintf(stderr, "%s: %s.\n", argv[1], strerror(errno));
return EXIT_FAILURE;
}
errno = EIO;
if (close(fd)) {
fprintf(stderr, "%s: %s.\n", argv[1], strerror(errno));
return EXIT_FAILURE;
}
fprintf(stderr, "%s: read %zu bytes, allocated %zu.\n", argv[1], file_length, file_allocd);
if (file_length > 0)
if (fwrite(file_buffer, file_length, 1, stdout) != 1) {
fprintf(stderr, "Error writing to standard output.\n");
return EXIT_FAILURE;
}
return EXIT_SUCCESS;
}
This lets you read (either into an empty, dynamically allocated buffer, or append to an existing dynamically allocated buffer) using any character set supported by your system (use iconv --list
to see the list), auto-converting the contents to UTF-8. 这使您可以使用系统支持的任何字符集(使用
iconv --list
查看列表)读取(读取到空的动态分配的缓冲区中,或追加到现有的动态分配的缓冲区中),然后将内容自动转换为UTF -8。
It uses a temporary input buffer (of INPUT_CHUNK
bytes) to read the file part by part, and reallocates the output buffer in multiples of OUTPUT_CHUNK
bytes, keeping at least OUTPUT_CHUNK
bytes available for each conversion. 它使用一个临时输入缓冲区(
INPUT_CHUNK
字节)来部分读取文件,并以OUTPUT_CHUNK
字节的倍数重新分配输出缓冲区,并保持每次转换至少可用OUTPUT_CHUNK
字节。 The constants may need a bit of tuning for different use cases; 对于不同的用例,这些常数可能需要一些调整。 they're by no means optimal or even suggested values.
它们绝不是最佳值,也不是建议值。 Larger ones lead to faster code, especially for
INPUT_CHUNK
, as most filesystems perform better when reading large chunks ( 2097152
is suggested size currently, if I/O performance is important) -- but you should have OUTPUT_CHUNK
at similar size, or perhaps twice that, to reduce the number of reallocations needed. 较大的文件会导致更快的代码,尤其是对于
INPUT_CHUNK
,因为大多数文件系统在读取大块时性能会更好(如果I / O性能很重要,则当前建议大小为2097152
)-但您应该使OUTPUT_CHUNK
的大小相近,或者两倍,以减少所需的重新分配数量。 (You can trim the resulting buffer afterwards, to used+1
bytes, using realloc()
, to avoid memory waste.) (之后,可以使用
realloc()
将生成的缓冲区修整为used+1
个字节,以避免浪费内存。)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.