简体   繁体   English

读取二进制文件C ++

[英]reading binary files C++

I would like to ask for help ... I am starting in C++ and I got this homework at school ... We got to write function bool UTF8toUTF16 (const char * src, const char * dst ); 我想寻求帮助...我从C ++开始,我在学校得到了这道功课...我们必须编写函数bool UTF8toUTF16 (const char * src, const char * dst ); which is supposed to read src file coded in UTF-8 and write it into dst file but in UTF-16. 它应该读取以UTF-8编码的src文件,并将其写入dst文件,但以UTF-16格式。 We also mustn't use any other libraries than in my code down... 除了在我的代码中,我们也不得使用任何其他库...

So the first thing I am trying to do is that I make a file "xx.txt" and in classic Windows notepad I write there for example char 'š'. 因此,我要做的第一件事是制作一个文件“ xx.txt”,并在经典的Windows记事本中编写了例如char'š'的文字。 Then am trying to write a program which reads each char of this file in binary mode byte by byte (or bytes by bytes) and prints it's value... but my program doesn't work like that... 然后我试图编写一个程序,以二进制模式(逐字节(或逐字节))读取此文件的每个字符,并打印其值...但是我的程序不能那样工作...

So I have this file 'xx.txt' where is only 'š' which has UTF-8 value 'c5 a1', UTF-16 value '0161' and Unicode value '161' and I suppose result that it will print: i = 161 (hex) or something close to this result at least... 所以我有这个文件“ xx.txt”,其中只有“š”具有UTF-8值“ c5 a1”,UTF-16值“ 0161”和Unicode值“ 161”,我想结果是它将打印:i = 161(十六进制)或至少接近此结果的值...

Here is my code so far: 到目前为止,这是我的代码:

#include <stdio.h>
#include <stdlib.h>
#include <iomanip>
#include <iostream>
#include <fstream>

using namespace std;

int main ( void ) {
    char name[] = "xx.txt";
    fstream F ( name, ios::in | ios::binary );
    unsigned int i;
    while( F.read ((char *) & i, 2))
    /* I dont know what size to write there - I would guess it s '2' - because I need 2     bytes for the char with hexUTF-16 code '0161', but 2 doesnt work*/
    cout << "i = " << hex << i << " (hex) ";
    cout << endl;
    F.close();
    system("PAUSE");
    return 0;}

Thanks in advance 提前致谢

Nikolas Jíša 尼古拉斯·吉莎(NikolasJíša)

You don't know how big a character is in utf8 until you finish parsing it, you need to read "chars" one at a time until you have a complete utf8 character. 在解析完字符之前,您不知道utf8中的字符有多大,您需要一次读取一个“字符”,直到您拥有完整的utf8字符为止。

edit - you don't say what you are getting as an output - but I suspect it's a byte ordering issue. 编辑-您没有说输出的内容-但我怀疑这是字节排序问题。
You might be better reading the input (if you know it is always a 16bit value) into a char array and then looking at the individual bytes. 您可能最好将输入(如果您知道它始终是16位值)读入char数组,然后查看各个字节。

See http://www.joelonsoftware.com/articles/Unicode.html 参见http://www.joelonsoftware.com/articles/Unicode.html

If your input is in UTF-8, you need to read one byte at a time, not two (you'll want i to have type unsigned char ). 如果您的输入使用UTF-8,则需要一次读取一个字节,而不是两个字节(您需要让i输入unsigned char类型)。 This gives you a stream of binary data, which you need to decode following the UTF-8 Specification , which will yield a stream of unsigned int s (Unicode code points), which you'll then need to re-encode according to the UTF-16 specification . 这为您提供了二进制数据流,您需要按照UTF-8规范对其进行解码,这将产生一个unsigned int (Unicode代码点)流,然后您需要根据UTF重新对其进行编码。 -16规格

It depends. 这取决于。 If the role of the class is to contain such objects (eg a container class), then its very idiomatic, and the normal way of doing things. 如果类的作用是包含此类对象(例如,容器类),则其非常惯用且是正常的处理方式。 In most other cases, however, it is considered preferrable to use getter and setter methods. 但是,在大多数其他情况下,最好使用getter和setter方法。 Not necessarily named getXxx and setXxx---the most frequent naming convention I've seen uses m_attr for the name of the attribute, and simply attr for the name of both the getter and the setter. 不一定要命名为getXxx和setXxx--我见过的最常见的命名约定是使用m_attr作为属性的名称,而只是将attr用作getter和setter的名称。 (Operator overloading will choose between them according to the number of arguments.) (运算符重载将根据参数的数量在它们之间进行选择。)

-- James Kanze 詹姆斯·坎泽

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM