简体   繁体   English

如何将非 ASCII 字符导入控制台?

[英]How to import non-ASCII characters into console?

I've been scratching my head for a while at this and I am in need of some assistance.我一直在为此挠头,我需要一些帮助。 Basically what I want the code to do is read in a series of non-ASCII symbols into an empty pre-set array, and I'm printing them to see if they do get read in which they currently did not.基本上我想要代码做的是将一系列非 ASCII 符号读入一个空的预设数组,然后我打印它们以查看它们是否被读取,而它们目前没有被读取。 Notepad displays them just fine but for some reason C++ doesn't recognise them as valid characters, any advice that is only about code and not changing the internal settings of my computer are strongly preferred.记事本可以很好地显示它们,但由于某种原因 C++ 不能将它们识别为有效字符,强烈推荐任何仅关于代码而不更改计算机内部设置的建议。

char displayCharacters[5] = {};

try {

    instream.open("characters.txt");
    instream >> displayCharacters;
    cout << "Here is the first symbol: " << displayCharacters[4];

} 

catch (exception) {

    cout << "Something went wrong with the file handling.";

}

And yes I have set up the instreams correctly, with the cout having been used from the import of iostream and using namespace std.是的,我已经正确设置了内流,从 iostream 的导入和使用命名空间 std 中使用了 cout。 Here's what the file contains:以下是文件包含的内容:

█
 
▀
▄
▓

Edit: The file is UTF-8 if you need to know.编辑:如果您需要知道,该文件是 UTF-8。

tl;dr; tl;博士;

You need to decode UTF-8 before you can index it.您需要先解码UTF-8 才能对其进行索引。 Read on for more details than I was expecting to write…请继续阅读以获取比我预期要写的更多的详细信息……


A C++ stream isn't encoding-aware – it's just a stream of bytes. C++ stream 不支持编码 - 它只是一个 stream 字节。 For example, this code to dump an entire UTF-8 string works just fine:例如,转储整个 UTF-8 字符串的代码可以正常工作:

#include <iostream>
#include <sstream>
#include <string>

int main() {
    // Simulate your `instream` using an `std::stringstream`
    std::stringstream instream;
    // Load the simulated `instream` using a UTF-8 string literal [1]
    instream << u8"█\n \n▀\n▄\n▓\n";
    
    // Print entire `instream`
    std::cout << instream.rdbuf();
}

[1]: https://en.cppreference.com/w/cpp/language/string_literal [1]: https://en.cppreference.com/w/cpp/language/string_literal

Your problem comes from the UTF-8 encoding itself.您的问题来自 UTF-8 编码本身。 UTF-8 is a multibyte encoding. UTF-8 是多字节编码。 Some characters (notably the ASCII characters) are encoded as a single byte.某些字符(尤其是 ASCII 字符)被编码为单个字节。 For instance, the letter a is encoded as the value 97 ( 0x61 in hex).例如,字母a被编码为值 97(十六进制的0x61 )。

Let's take a look at the five characters you're trying to print:让我们看一下您要打印的五个字符:

Char字符 Unicode codepoint Unicode 代码点 UTF-8 encoding UTF-8编码 Unicode name Unicode 名称
U+2588 0xe2 0x96 0x88 0xe2 0x96 0x88 FULL BLOCK全块
U+20 0x20 SPACE ( no link; this one's just plain ASCII )空格(没有链接;这个只是普通的 ASCII
U+2580 0xe2 0x96 0x80 0xe2 0x96 0x80 UPPER HALF BLOCK上半块
U+2584 0xe2 0x96 0x84 0xe2 0x96 0x84 LOWER HALF BLOCK下半块
U+2593 0xe2 0x96 0x93 0xe2 0x96 0x93 DARK SHADE暗影

The UTF-8 encoding is the interesting part here – that's how each of these characters is stored as a sequence of bytes in a UTF-8 encoded file. UTF-8 编码是这里有趣的部分——这就是每个字符如何作为字节序列存储在 UTF-8 编码文件中的方式。 For the four block-drawing characters (we'll ignore the space because that's just a single-byte character), the encoding takes three bytes.对于四个块图字符(我们将忽略空格,因为它只是一个单字节字符),编码需要三个字节。

But why does the encoding take three bytes if the codepoint is only two bytes long?但是,如果代码点只有两个字节长,为什么编码需要三个字节呢?

Good question.好问题。 Let's break down the first character:让我们分解第一个字符:

   0xe2     0x96     0x88
 11100010 10010110 10001000
 AAAA^^^^ BB^^^^^^ BB^^^^^^

The annotations underneath the binary indicates how the encoding works.二进制文件下方的注释指示编码的工作方式。

Since the codepoint for the character is too big to fit into a single byte, UTF-8 breaks it into multiple bytes.由于字符的代码点太大而无法放入单个字节,因此 UTF-8 将其分成多个字节。 However, there must be a way to determine that a sequence of bytes represents a single character, not just a sequence of simpler characters.但是,必须有一种方法来确定字节序列表示单个字符,而不仅仅是简单字符序列。 This is where the byte prefixes (A, B and C) come into play.这就是字节前缀(A、B 和 C)发挥作用的地方。 The first byte in the multibyte sequence begins with a sequence of 1 bits to represent the total number of bytes in the encoded character, followed by a terminating 0 .多字节序列中的第一个字节以1位序列开始,表示编码字符中的总字节数,后跟一个终止0 Here we need three bytes, so we have 1110 (A).这里我们需要三个字节,所以我们有1110 (A)。

The prefixes of the remaining two bytes indicate that they are continuation bytes (ie they should not be considered the beginning of a character).其余两个字节的前缀表明它们是连续字节(即不应将它们视为字符的开头)。 The prefix for continuation bytes is defined as 10 (B).连续字节的前缀定义为10 (B)。

After removing these prefixes, he remaining bits (marked with carets [ ^ ]) are packed and parsed to retrieve the encoded codepoint.删除这些前缀后,剩余的位(用脱字符 [ ^ ] 标记)被打包并解析以检索编码的代码点。

Single byte characters (ie the basic US-ASCII plane of characters from 0 to 127) only require 7 bits to encode, so a 0 bit is prefixed to indicate there are no continuation bytes for this character.单字节字符(即从 0 到 127 字符的基本 US-ASCII 平面)只需要 7 位进行编码,因此前缀0表示该字符没有连续字节。

What does all this have to do with your problem?这一切与你的问题有什么关系?

I said earlier that “ your problem comes from the UTF-8 encoding itself ”.我之前说过“您的问题来自 UTF-8 编码本身”。 Well, I lied.好吧,我撒谎了。 Sorry.对不起。 Your problem comes from attempting to read UTF-8 encoded data as a plain sequence of bytes.您的问题来自尝试将 UTF-8 编码数据作为纯字节序列读取。

With the encoding table above, let's take a look at the raw bytes in your file (assuming a single \n terminating each line):使用上面的编码表,让我们看一下文件中的原始字节(假设单个\n终止每一行):

e2 96 88 0a 20 0a e2 96 80 0a e2 96 84 0a e2 96 93 0a
\--01--/    02    \--03--/    \--04--/    \--05--/

I've marked the characters by their line numbers.我已经用它们的行号标记了这些字符。

From this dump, you can easily see what the output of your problematic code will be:从这个转储中,您可以轻松地看到问题代码的 output 将是:

char displayCharacters[5] = {};
std::cout << "Here is the first symbol: " << displayCharacters[4];

It's a space, Remember, the stream isn't aware of the file's encoding so it just spits out a sequence of bytes (a char in C/C++ is just an 8-bit variable).这是一个空格,记住,stream 不知道文件的编码,所以它只是吐出一个字节序列(C/C++ 中的char只是一个 8 位变量)。 Your array ( displayCharacters ) contains the sequence of bytes shown above, so subscripting it to get the fourth (zero-indexed) element returns the byte 0x20 .您的数组 ( displayCharacters ) 包含上面显示的字节序列,因此下标它以获取第四个(零索引)元素返回字节0x20

You actually got lucky here.你在这里真的很幸运。 Indexing UTF-8 data as raw bytes often causes much uglier errors.将 UTF-8 数据索引为原始字节通常会导致更丑陋的错误。 Remember those continuation bytes (beginning 10 )?还记得那些连续字节(开始10 )吗? If you extract and try to print one of those on its own, your terminal will have no idea what to do with it.如果您提取并尝试自己打印其中一个,您的终端将不知道如何处理它。 Similarly with the beginning of a multibyte sequence (prefix 11 ).与多字节序列的开头类似(前缀11 )。

Properly indexing UTF-8 strings is hard .正确索引 UTF-8 字符串很难 You'll almost certainly want a library to handle it.你几乎肯定会想要一个库来处理它。

Depending on the use and/or origin of the file in question, you might want to consider a fixed-width encoding such as UTF-32 .根据相关文件的用途和/或来源,您可能需要考虑使用固定宽度编码,例如UTF-32

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM