简体   繁体   English

格式化包含非ascii字符的列

[英]Formatting columns containing non-ascii characters

So I want to align fields containing non-ascii characters. 所以我想对齐包含非ascii字符的字段。 The following does not seem to work: 以下似乎不起作用:

for word1, word2 in [['hello', 'world'], ['こんにちは', '世界']]:
    print "{:<20} {:<20}".format(word1, word2)

hello                world
こんにちは      世界

Is there a solution? 有解决方案吗?

You are formatting a multi-byte encoded string. 您正在格式化多字节编码的字符串。 You appear to be using UTF-8 to encode your text and that encoding uses multiple bytes per codepoint (between 1 and 4 depending on the specific character). 您似乎使用UTF-8对文本进行编码,并且该编码每个代码点使用多个字节(1到4之间取决于特定字符)。 Formatting a string counts bytes , not codepoints, which is one reason why your strings end up misaligned: 格式化字符串会计算字节数,而不是代码点数,这是字符串最终未对齐的原因之一:

>>> len('hello')
5
>>> len('こんにちは')
15
>>> len(u'こんにちは')
5

Format your text as Unicode strings instead, so that you can count codepoints, not bytes: 将文本格式化为Unicode字符串,以便您可以计算代码点,而不是字节:

for word1, word2 in [[u'hello', u'world'], [u'こんにちは', u'世界']]:
    print u"{:<20} {:<20}".format(word1, word2)

Your next problem is that these characters are also wider than most; 你的下一个问题是这些角色也比大多数人都 ; you have double-wide codepoints: 你有双宽码点:

>>> import unicodedata
>>> unicodedata.east_asian_width(u'h')
'Na'
>>> unicodedata.east_asian_width(u'世')
'W'
>>> for word1, word2 in [[u'hello', u'world'], [u'こんにちは', u'世界']]:
...     print u"{:<20} {:<20}".format(word1, word2)
...
hello                world
こんにちは                世界

str.format() is not equipped to deal with that issue; str.format()没有能力处理这个问题; you'll have to manually adjust your column widths before formatting based on how many characters are registered as wider in the Unicode standard. 在格式化之前,您必须根据Unicode标准中注册的字符数更多来手动调整列宽。

This is tricky because there is more than one width available. 这很棘手,因为有多个宽度可用。 See the East Asian Width Unicode standard annex ; 参见东亚宽度 Unicode标准附件 ; there are narrow , wide and ambigious widths; ambigious宽度; narrow is the width most other characters print at, wide is double that on my terminal. 窄是大多数其他字符打印的宽度,宽度是我终端上的两倍。 Ambiguous is... ambiguous as to how wide it'll actually be displayed: 不明确的是......它实际显示的范围有多么模糊:

Ambiguous characters require additional information not contained in the character code to further resolve their width. 不明确的字符需要字符代码中未包含的其他信息才能进一步解析其宽度。

It depends on the context how they are displayed; 这取决于它们如何显示的背景; greek characters for example are displayed as narrow characters in a Western text, but wide in an East Asian context. 例如,希腊字符在西方文本中显示为窄字符,但在东亚语境中显示为宽字符。 My terminal displays them as narrow, but other terminals (configured for an east-asian locale, for example) may display them as wide instead. 我的终端显示它们很窄,但是其他终端(例如,配置为东亚语言环境)可能会将它们显示为宽。 I'm not sure if there are any fool-proof ways of figuring out how that would work. 我不确定是否有任何万无一失的方法来弄清楚它是如何起作用的。

For the most part, you need to count characters with a 'W' or 'F' value for unicodedata.east_asian_width() as taking 2 positions; 在大多数情况下,您需要将unicodedata.east_asian_width()'W''F'值字符计为2个位置; subtract 1 from your format width for each of these: 从每种格式的宽度中减去1:

def calc_width(target, text):
    return target - sum(unicodedata.east_asian_width(c) in 'WF' for c in text)

for word1, word2 in [[u'hello', u'world'], [u'こんにちは', u'世界']]:
    print u"{0:<{1}} {2:<{3}}".format(word1, calc_width(20, word1), word2, calc_width(20,  word2))

This then produces the desired alignment in my terminal : 然后在我的终端中产生所需的对齐:

>>> for word1, word2 in [[u'hello', u'world'], [u'こんにちは', u'世界']]:
...     print u"{0:<{1}} {2:<{3}}".format(word1, calc_width(20, word1), word2, calc_width(20,  word2))
...
hello                world
こんにちは           世界

The slight misalignment you may see above is your browser or font using a different width ratio (not quite double) for the wide codepoints. 可能会在上面看到的轻微错位是您的浏览器或字体使用不同宽度比(不是很多)宽代码点。

All this comes with a caveat: not all terminals support the East-Asian Width Unicode property, and display all codepoints at one width only. 所有这些都需要注意:并非所有终端都支持East-Asian Width Unicode属性,并且仅显示一个宽度的所有代码点。

This is no easy task - this is not simply "non-ascii" - they are wide-unicode characters, and their displaying is quite tricky - and fundamentally depends more on the terminal type you are using than the number of spaces you put in there. 这不是一件容易的事 - 这不仅仅是“非ascii” - 它们是宽unicode字符,它们的显示非常棘手 - 从根本上更多地取决于你使用的终端类型而不是你放在那里的空间数量。

To start with, you have to use UNICODE strings. 首先,您必须使用UNICODE字符串。 Since you are in Python 2, this means you should prefix your text-quotes with "u". 由于您使用的是Python 2,这意味着您应该在文本引号前加上“u”。

for word1, word2 in [[u'hello', u'world'], [u'こんにちは', u'世界']]:
    print "{:<20} {:<20}".format(word1, word2)

That way, Python can actually recognize each character inside the strings as a character, not as a collection of bytes that just are displayed back due to chance. 这样,Python实际上可以将字符串中的每个字符都识别为字符,而不是因为偶然而显示的字节集合。

>>> a = u'こんにちは'
>>> len(a)
5
>>> b = 'こんにちは'
>>> len(b)
15

At first glance it looks like these lenghts could be used to calculate the character width. 乍一看,这些长度看起来可以用来计算字符宽度。 Unfortunatelly, this byte lenght of the utf--8 encoded characters is not related to the actual displayed width of the characters. 不幸的是,utf-8编码字符的这个字节长度与字符的实际显示宽度无关。 Single width unicode characters are also multi-byte in utf-8 (like ç ) 单宽度unicode字符在utf-8中也是多字节的(如ç

Now, once we are talking about unicode, Python does include some utilities - including a function call to know what is the display unit of each unicode-character - it is unicode.east_asian_width - this allows you to have a way to compute the width of each string and then to have proper spacing numbers: 现在,一旦我们谈论unicode,Python确实包含了一些实用程序 - 包括一个函数调用来知道每个unicode字符的显示单元 - 它是unicode.east_asian_width - 这允许你有办法计算宽度每个字符串然后有适当的间隔号:

The auto-calculation of the " {: 自动计算“{:

import unicode

def display_len(text):
    res = 0
    for char in text:
        res += 2 if unicodedata.east_asian_width(char) == 'W' else 1
    return res

for word1, word2 in [[u'hello', u'world'], [u'こんにちは', u'世界']]:
    width_format = u"{{}}{}{{}}".format(" " * (20 - (display_len(word1))))
    print width_format.format(word1, word2)

That has worked for me on my terminal: 这在我的终端上对我有用:

hello              world
こんにちは          世界

But as Martijn puts it, it si more complicated than that. 但正如Martijn所说,它比这复杂得多。 There are ambiguyous characters and terminal types. 有模糊的字符和终端类型。 If you really need this text to be aligned in a text terminal, then you should use a terminal-library, like curses , whcih allow you to specify a display coordinate to print a string at. 如果你真的需要在文本终端中对齐这个文本,那么你应该使用一个终端库,比如curses ,允许你指定一个显示坐标来打印一个字符串。 That way, you can simply position your cursor explictly on the appropriate column before printing each word, and avoid all display-width computation. 这样,您可以在打印每个单词之前将光标明确地定位在相应的列上,并避免所有显示宽度计算。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM