繁体 English 中英

std :: string和UTF-8编码的unicode

[英]std::string and UTF-8 encoded unicode

原文 2013-09-07 09:27:37 4 3 c++/ string/ unicode/ utf-8

如果我理解得很好，可以使用string和wstring来存储UTF-8文本。

对于char，ASCII字符占用一个字节，一些汉字占用3或4等。这意味着str[3]不一定指向第4个字符。
使用wchar_t相同的东西，但每个字符使用的最小字节数总是2（而不是char的1），而3或4字节宽的字符将需要2个wchar_t 。

对？

那么，如果我想将string::find_first_of()或string::compare()等与这样一个奇怪的编码字符串一起使用呢？ 它会起作用吗？ 字符串类是否处理字符具有可变大小的事实？ 或者我应该只将它们用作伪特征字节数组，在这种情况下，我宁愿选择wchar_t[]缓冲区。

如果std::string没有处理，第二个问题：是否有库提供可以处理UTF-8编码的字符串类，以便str[3]实际指向第3个字符（这将是长度为1的字节数组）到4）？

3 个解决方案

你在谈论Unicode。 Unicode使用32位来表示字符。 然而，由于这会浪费内存，因此存在更紧凑的编码。 UTF-8就是这样一种编码。 它假定您使用字节单位并将Unicode字符映射到1,2,3或4个字节。 UTF-16是另一种使用单词作为单位并将Unicode字符映射到1或2个字（2或4个字节）的UTF-16。 您可以同时使用string和wchar_t进行编码。 对于英文文本/数字，UTF-8往往更紧凑。

无论使用哪种编码和类型（比较），有些东西都会起作用。 但是，所有需要理解一个角色的功能都将被破坏。 即第5个字符并不总是底层数组中的第5个字符。 它可能看起来像是在使用某些示例，但它最终会破坏。 string :: compare可以工作，但不希望按字母顺序排序。 这取决于语言。 string :: find_first_of适用于某些但不是全部。 长字符串可能只是因为它们很长而较短，而较短字符串可能会被字符对齐混淆并产生非常难以发现的错误。

最好的办法是找到一个为你处理它的库，并忽略下面的类型（除非你有充分的理由选择其中一个）。

您无法使用标准库中的std :: string或任何其他工具处理Unicode。 使用外部库，例如： http ： //utfcpp.sourceforge.net/

你是对的：
...这意味着str [3]并不一定指向第4个字符......只能将它们用作伪特征字节数组...

C ++字符串只能处理ascii字符。 这与可以处理Unicode字符的Java字符串不同。 您可以将中文字符的编码结果（字节）存储到字符串中（C / C ++中的字符只是字节），但这没有意义，因为字符串只是将字节视为ascii字符，因此您不能使用字符串函数来处理它。
wstring可能是你需要的东西。

有些事情应该澄清。 UTF-8只是Unicode字符的编码方法（将字符转换为字节格式）。

将 Unicode UTF-8 字符串存储在 std::string 中

[英]Storing unicode UTF-8 string in std::string

在std :: wstring和std :: string之间处理UTF-8编码的字符串

[英]Handling UTF-8 encoded strings between std::wstring and std::string

在Windows上获取boost :: filesystem :: path作为UTF-8编码的std :: string

[英]Getting a boost::filesystem::path as an UTF-8 encoded std::string, on Windows

获取 UTF-8 编码的 std::string 的实际长度？

[英]Getting the actual length of a UTF-8 encoded std::string?

如何将UTF-8编码的std :: string转换为UTF-16 std :: string

[英]How to convert UTF-8 encoded std::string to UTF-16 std::string

std :: string本地编码为UTF-8但char不能保存utf字符？

[英]std::string is natively encoded in UTF-8 but char can not hold utf characters?

为什么unicode char在std :: string中存储为UTF-8，在wchar_t中存储为UTF-16/32？

[英]Why unicode char is stored as UTF-8 in std::string and UTF-16/32 in wchar_t?

如何在c ++中读取用utf-8编码的java unicode字节字符串

[英]How to read java unicode byte string encoded with utf-8 in c++

utf-8编码std :: string？

[英]utf-8 encoding a std::string?

C++：如何将 std::string 的内容写入 UTF-8 编码文件？

[英]C++: How do I write the contents of std::string to a UTF-8 encoded file?

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将 Unicode UTF-8 字符串存储在 std::string 中在std :: wstring和std :: string之间处理UTF-8编码的字符串在Windows上获取boost :: filesystem :: path作为UTF-8编码的std :: string 获取 UTF-8 编码的 std::string 的实际长度？如何将UTF-8编码的std :: string转换为UTF-16 std :: string std :: string本地编码为UTF-8但char不能保存utf字符？为什么unicode char在std :: string中存储为UTF-8，在wchar_t中存储为UTF-16/32？如何在c ++中读取用utf-8编码的java unicode字节字符串 utf-8编码std :: string？ C++：如何将 std::string 的内容写入 UTF-8 编码文件？

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM