简体   繁体   English

将包含unicode文件的内容复制到c中的char数组中

[英]Copy contain of an unicode file in to the char array in c

i write a c code as following, that copy a file. 我写了一个如下的c代码,复制一个文件。 it works truely for unicode files (exe, rar for example), i use of a char data-type array to copy file "block" in that. 它适用于unicode文件(例如exe,rar),我使用char数据类型数组来复制文件“block”。 i know that, char data-type just can store 1 byte as extended ASCII standard. 我知道, char data-type只能存储1个字节作为扩展ASCII标准。

in fread() function, used of buffer[buflen] variable as char array due to copy a block of an exe file ( 100 byte) in that, then copy buffer[buflen] contain in an other file. fread()函数中, buffer[buflen]变量用作char数组,因为复制了一个exe文件的块( 100字节),然后复制buffer[buflen]包含在另一个文件中。 how it possible that, a block of unicode characters, stores in char ? 怎么可能,一个unicode字符块存储在char why this code works for unicode files truely without any problem? 为什么这段代码真正适用于unicode文件没有任何问题?

copyFile function : copyFile函数:

void copyFile(const char *src, const char *dst)
{
    const int buflen = 100;
    char buffer[buflen];
    long fileSize, curFileSize, offset = 0;
    FILE *r, *w;

    r = fopen(src, "r+b");
    w = fopen(dst, "w+b");

    fseek(r, 0, SEEK_END);
    fileSize = ftell(r);
    fseek(r, 0, SEEK_SET);

    while(fileSize - (curFileSize = ftell(r)) >= buflen)
    {
        fseek(r, offset * buflen, SEEK_SET);
        fread(&buffer, sizeof(buffer), 1, r);
        fwrite(&buffer, sizeof(buffer), 1, w);
        offset++;
    }

    if ((fileSize - curFileSize) != 0)
    {
        fseek(r, (offset - 1) + (curFileSize), SEEK_SET);
        fread(&buffer, fileSize - curFileSize, 1, r);
        fwrite(&buffer, fileSize - curFileSize, 1, w);
    }

    fclose(w);
    fclose(r);
}

entrypoint section : entrypoint部分:

int main()
{
    copyFile("e:/1.exe", "e:/2.exe");
    return 0;
}

what is the reason of using char data-type or a struct (containing of char ) in fread and fwrite functions? freadfwrite函数中使用char data-typestruct (包含char )的原因是什么?

Thanks of everybody to help me. 感谢大家帮助我。

Any file, regardless of encoding, is just a sequence of bytes. 无论编码如何,任何文件都只是一个字节序列。 The char type can store any byte, so you're just copying the file byte for byte. char类型可以存储任何字节,因此您只需将文件字节复制为字节。 ( char is used in C and C++ as both a character type and a numeric type capable of holding a byte. This can be confusing, but both usages are valid.) char在C和C ++中用作字符类型和能够保存字节的数字类型。这可能令人困惑,但两种用法都是有效的。)

fread and fwrite are specified in terms of char because they read and write bytes. freadfwrite是根据char指定的,因为它们读写字节。

Well, the file you are reading will probably be encoded with the utf-8 encoding, which makes the utf chars in the range U+0000 --- U+007f the same as their ASCII counterparts (this allows reading normally, even if you don't have a UNICODE compliant reader). 那么,你正在阅读的文件可能会被使用UTF-8编码,这使得该范围内的UTF 编码的字符U+0000 - U+007f相同,它们的ASCII同行(这允许读取正常,即使你没有符合UNICODE标准的读卡器。 The characters in the iso-latin-? iso-latin-?的字符iso-latin-? set, map normally into two character sequences, and characters like into three character sequences or more. 设置,通常映射到两个字符序列,以及像到三个字符序列或更多的字符。 As long as you don't modify the data you are reading, it doesn't matter the kind of data stored ---begin it binary or textual, or the encoding used---, the copy will be exactly equal than the original (or you will have to look into your code, because it is changing the copy, making it to appear different than the original) 只要你不修改你正在阅读的数据,存储的数据类型无关紧要---开始二进制或文本,或者使用的编码---,副本将完全等于原始数据 (或者您将不得不查看您的代码,因为它正在更改副本,使其看起来与原始版本不同)

Normally, you will not have any problem, as long as you don't break any of these sequences (which means they came together to the file and you wrote them separately ---to different places--- on the copy) This doesn't happen in a file copy normally. 通常,你不会有任何问题,只要你不破坏任何这些序列(这意味着他们一起到文件,你分别写到 - 不同的地方---在副本上)这不通常不会发生在文件副本中。 Determining the beginning of a UTF-8 or UTF-16 character is relatively easy, as all characters in a UNICODE encoding can be identified, either going forward or backwards in the data stream. 确定UTF-8或UTF-16字符的开头相对容易,因为可以在数据流中向前或向后识别UNICODE编码中的所有字符。

For UTF-8, the characters are composed of a first character, which encodes the number of bytes on this character, and a tail of n-1 such characters (again, easily detectable) The first character will be 0b110xxxxx ( 0b means the octet in its binary representation from now on) for a two byte character, 0b1110xxxx for a three byte, and so on up to 0b1111110x for a six bytes character) the rest of characters that follow it, are encoded as 0b10xxxxxx . 对于UTF-8,字符由第一个字符组成,该字符编码该字符的字节数,以及n-1字符的尾部(同样,易于检测)第一个字符为0b110xxxxx0b表示八位字节)在它的二进制表示中从现在开始)对于一个双字节字符, 0b1110xxxx用于三个字节,依此类推,直到0b1111110x用于六个字节字符) 0b1111110x的其余字符编码为0b10xxxxxx If you go forward, once you se a byte with the MSB set, you know you are in front of a multibyte sequence, you have to count the number of ones on top before the first 0 and you have the number of bytes that compose the character. 如果你继续前进,一旦你使用MSB设置了一个字节,你知道你在多字节序列前面,你必须在第一个0之前计算顶部的数量,并且你有组成的字节数。字符。 Going backwards, you first encounter a 0b10xxxxxx char, and you have to go backwards until you get a 0b11xxxxxx char, which will be the first char in a sequence. 向后看,你首先遇到一个0b10xxxxxx字符,你必须向后走,直到你得到一个0b11xxxxxx字符,这将是序列中的第一个字符。 Then you use the first procedure again. 然后再次使用第一个过程。

In UTF-16 the procedure is almost the same. 在UTF-16中,程序几乎相同。 Characters under 0x10000 are encoded as one 16 bit number, and characters equal or above are encoded using a surrogate pair of 16 bit numbers, they have the following pattern: 0b110110xxxxxxxxxx for the first 16bit of the pair, and 0b110111xxxxxxxxxx for the second. 下字符0x10000被编码为一个16位的数,和字符等于或高于使用代理对16张比特数进行编码,它们具有以下模式: 0b110110xxxxxxxxxx用于对所述第一16位,和0b110111xxxxxxxxxx用于第二。 This time, you have to substract 0x10000 to the UTF character number before you get the x's that go in the xxxx... part of the two 16bit quantities, but the procedure is similar to the used in utf-8. 这次,你需要将0x10000减去UTF字符数,然后才能获得两个16位数量的xxxx...的x,但过程类似于utf-8中使用的过程。

In UTF-32 encoding, all the characters are stored as 32bit quantities, so for the moment there's no plan for multisequence encoding. UTF-32编码中,所有字符都存储为32位数量,因此目前还没有多序列编码的计划。 All characters are transmitted as 32bit quantities. 所有字符都以32位数量传输。 At the moment of this writting, the standard is V8.0 and incorporates 1,114,112 code points. 在撰写本文时,标准是V8.0并包含1,114,112个代码点。

When another UTF encoding is used, for example, UTF-16, all characters are encoded as 16bit quantities, that could change for example if you read them in a little-endian architecture, but you wrote them in a big-endian architecture (you should swap each two bytes for the characters to conserve their UTF values in the target architecture) but again, there can be tricks to cope with this (there's a BOM special signature that allows to check with endianness is being used in the data) so, as long as you copy a file, byte by byte, no reordering of characters is done and the final image is exactly the same as the one you had previously, so UTF should not be concerned. 当使用另一个UTF编码时,例如,UTF-16,所有字符都被编码为16位数量,例如,如果您在小端架构中读取它们,但是您可以在大端架构中编写它们(您可以更改)应该交换每两个字节的字符以保存目标体系结构中的UTF值)但同样,可以有一些技巧来处理这个问题(有一个BOM特殊签名,允许检查数据中是否正在使用字节顺序)所以,只要你一个字节地复制一个文件,没有完成字符的重新排序,最终的图像与你之前的图像完全相同,所以不应该关注UTF。

In variable length encodings ( utf-5 , utf-7 , utf-8 and utf-16) the problem arises if you break one of the multiple sequences that map into the actual UTF codes, because this makes the character non-recognizable by the decoding process (it becomes an illegal character) and then you normally get some special character in the output, signalling the invalid character detected. 在可变长度编码( utf-5utf-7 ,utf-8和utf-16)中,如果你打破映射到实际UTF代码的多个序列中的一个,就会出现问题,因为这使得该字符不可识别。解码过程(它变成了非法字符)然后你通常在输出中得到一些特殊字符,表示检测到无效字符。 In constant length encodings (utf-32) you get a broken char only if you split your file at a non multiple of 32bits boundary. 在恒定长度编码(utf-32)中,只有在将文件分割为非32位边界时,才能获得损坏的字符。

UTF was designed with the aim of being an efficient way to store and send a practically unbound set of characters, and to achieve this, it maps (or tries to map) the most common characters as one byte, augmenting the lenght as more specific or rare characters are selected. UTF的设计目的是成为存储和发送几乎未绑定的字符集的有效方法,为此,它将最常见的字符映射(或尝试映射)为一个字节,将长度增加为更具体或者选择了罕见的字符。

The main source of information about UNICODE is at UNICODE FORUM , where you will find specifications, guidelines and even character maps for the full UNICODE range. 关于UNICODE的主要信息来源是UNICODE FORUM ,在那里您可以找到完整UNICODE系列的规格,指南甚至字符图。 UTF-8, UTF-16 and UTF-32 encodings are described here. 这里描述了UTF-8,UTF-16和UTF-32编码。 For utf-5 and utf-7 , you have to follow the above links. 对于utf-5utf-7 ,您必须遵循以上链接。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM