简体   繁体   English

UTF-8宽度显示汉字问题

[英]UTF-8 Width Display Issue of Chinese Characters

When I use Perl or C to printf some data, I tried their format to control the width of each column, like 当我使用Perl或C来printf一些数据时,我尝试使用它们的格式来控制每列的​​宽度,比如

printf("%-30s", str);

But when str contains Chinese character, then the column doesn't align as expected. 但是当str包含中文字符时,该列不会按预期对齐。 see the attachment picture. 看附件图片。

My ubuntu's charset encoding is zh_CN.utf8, as far as I know, utf-8 encoding has 1~4 length of bytes. 我的ubuntu的charset编码是zh_CN.utf8,据我所知,utf-8编码有1~4个字节长度。 Chinese character has 3 bytes. 汉字有3个字节。 In my test, I found printf's format control count a Chinese character as 3, but it actually displays as 2 ascii width. 在我的测试中,我发现printf的格式控件将中文字符计为3,但它实际上显示为2 ascii宽度。

So the real display width is not a constant as expected but a variable related to the number of Chinese character, ie 因此,实际显示宽度不是预期的常数,而是与汉字数量相关的变量,即

Sw(x) = 1 * (w - 3x) + 2 * x = w - x

w is the width limit expected, x is the count of Chinese characters, Sw(x) is the real display width. w是预期的宽度限制,x是中文字符的数量,Sw(x)是实际显示宽度。

So the more Chinese character str contains, the shorter it displays. 因此,中文字符str包含的越多,它显示的越短。

How can I get what I want? 我怎样才能得到我想要的东西? Count the Chinese characters before printf? 在printf之前计算汉字?

As far as I know, all Chinese or even all wide characters I guess, displays as 2 width, then why printf count it as 3? 据我所知,我猜所有的中文甚至所有宽字都显示为2宽,那么为什么printf算为3呢? UTF-8's encoding has nothing to do with display length. UTF-8的编码与显示长度无关。

Yes, this is a problem with all versions of printf that I am aware of. 是的,这是我所知道的所有版本的printf的问题。 I briefly discuss the matter in this answer and also in this one . 我简要地讨论此事这个答案也是在这一个

For C, I do not know of a library that will do this for you, but if anyone has it, it would be ICU. 对于C,我不知道哪个库会为你做这个,但如果有人有它,它将是ICU。

For Perl, you have to use the Unicode::GCString module form CPAN to calculate the number of print columns a Unicode string will take up. 对于Perl,您必须使用CPAN的Unicode :: GCString模块来计算Unicode字符串将占用的打印列数。 This takes into account Unicode Standard Annex #11: East Asian Width . 这考虑了Unicode标准附件#11:东亚宽度

For example, some code points take up 1 column and others take up 2 columns. 例如,某些代码点占用1列而其他代码占用2列。 There are even some that take up no columns at all, like combining characters and invisible control characters. 甚至有一些根本不占用任何列,例如组合字符和不可见的控制字符。 The class has a columns method that returns how many columns the string takes up. 该类有一个columns方法,它返回字符串占用的列数。

I have an example of using this for aligning Unicode text vertically here . 我已经使用这个为垂直对齐Unicode文本的例子在这里 It will sort a bunch of Unicode strings, including some with combining characters and “wide” Asian ideograms (CJK characters), and allow you to align things vertically. 它将对一堆Unicode字符串进行排序,包括一些组合字符和“宽”亚洲表意文字(CJK字符),并允许您垂直对齐。

样本终端输出

Code for the little umenu demo program which prints that nicely aligned output, is included below. 下面列出了打印出良好对齐输出的小umenu演示程序的代码。

You might also be interested the far more ambitious Unicode::LineBreak module, of which the aforementioned Unicode::GCString class is just a smaller component. 您可能还会对更加雄心勃勃的Unicode :: LineBreak模块感兴趣,其中前面提到的Unicode::GCString类只是一个较小的组件。 This module is much cooler, and takes into account Unicode Standard Annex #14: Unicode Line Breaking Algorithm . 该模块更酷,并考虑到Unicode标准附件#14:Unicode断行算法

Here's the code for the little umenu demo, tested on Perl v5.14: 这是在umenu测试的小umenu演示的代码:

 #!/usr/bin/env perl
 # umenu - demo sorting and printing of Unicode food
 #
 # (obligatory and increasingly long preamble)
 #
 use utf8;
 use v5.14;                       # for locale sorting
 use strict;
 use warnings;
 use warnings  qw(FATAL utf8);    # fatalize encoding faults
 use open      qw(:std :utf8);    # undeclared streams in UTF-8
 use charnames qw(:full :short);  # unneeded in v5.16

 # std modules
 use Unicode::Normalize;          # std perl distro as of v5.8
 use List::Util qw(max);          # std perl distro as of v5.10
 use Unicode::Collate::Locale;    # std perl distro as of v5.14

 # cpan modules
 use Unicode::GCString;           # from CPAN

 # forward defs
 sub pad($$$);
 sub colwidth(_);
 sub entitle(_);

 my %price = (
     "γύρος"             => 6.50, # gyros, Greek
     "pears"             => 2.00, # like um, pears
     "linguiça"          => 7.00, # spicy sausage, Portuguese
     "xoriço"            => 3.00, # chorizo sausage, Catalan
     "hamburger"         => 6.00, # burgermeister meisterburger
     "éclair"            => 1.60, # dessert, French
     "smørbrød"          => 5.75, # sandwiches, Norwegian
     "spätzle"           => 5.50, # Bayerisch noodles, little sparrows
     "包子"              => 7.50, # bao1 zi5, steamed pork buns, Mandarin
     "jamón serrano"     => 4.45, # country ham, Spanish
     "pêches"            => 2.25, # peaches, French
     "シュークリーム"    => 1.85, # cream-filled pastry like éclair, Japanese
     "막걸리"            => 4.00, # makgeolli, Korean rice wine
     "寿司"              => 9.99, # sushi, Japanese
     "おもち"            => 2.65, # omochi, rice cakes, Japanese
     "crème brûlée"      => 2.00, # tasty broiled cream, French
     "fideuà"            => 4.20, # more noodles, Valencian (Catalan=fideuada)
     "pâté"              => 4.15, # gooseliver paste, French
     "お好み焼き"        => 8.00, # okonomiyaki, Japanese
 );

 my $width = 5 + max map { colwidth } keys %price;

 # So the Asian stuff comes out in an order that someone
 # who reads those scripts won't freak out over; the
 # CJK stuff will be in JIS X 0208 order that way.
 my $coll  = new Unicode::Collate::Locale locale => "ja";

 for my $item ($coll->sort(keys %price)) {
     print pad(entitle($item), $width, ".");
     printf " €%.2f\n", $price{$item};
 }

 sub pad($$$) {
     my($str, $width, $padchar) = @_;
     return $str . ($padchar x ($width - colwidth($str)));
 }

 sub colwidth(_) {
     my($str) = @_;
     return Unicode::GCString->new($str)->columns;
 }

 sub entitle(_) {
     my($str) = @_;
     $str =~ s{ (?=\pL)(\S)     (\S*) }
              { ucfirst($1) . lc($2)  }xge;
     return $str;
 }

As you see, the key to making it work in that particular program is this line of code, which just calls other functions defined above, and uses the module I was discussing: 如您所见,使其在该特定程序中工作的关键是这行代码,它只调用上面定义的其他函数,并使用我正在讨论的模块:

print pad(entitle($item), $width, ".");

That will pad out the item to the given width using dots as the fill character. 这将使用点作为填充字符将项目填充到给定宽度。

Yes, it's a lot less convenient that printf , but at least it is possible. 是的, printf不太方便,但至少有可能。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM