简体   繁体   English

过滤Java中用户字体中缺少的字符

[英]Filtering characters missing from the user’s font in Java

I want to build a somewhat simple table with Java (as an exercise) to check for the existence of legal printable Unicode code point in the end-user's font. 我想用Java构建一个稍微简单的表(作为练习),以检查最终用户字体中是否存在合法的可打印Unicode代码点。 Because some fonts cannot print valid code points, I have to know which printable code points the user's font is nevertheless missing and so cannot print. 因为某些字体不能打印有效的代码点,所以我必须知道用户字体仍然缺少哪个可打印的代码点,因此无法打印。

For example, if a font supports only Latin characters, I cannot print Greek characters using it, let alone Japanese characters. 例如,如果一种字体仅支持拉丁字符,那么我将无法使用它来打印希腊字符,更不用说日语字符了。 Unicode says they're all printable, but the user's font may not be good enough. Unicode表示它们都是可打印的,但是用户的字体可能不够好。

After a little research I've been able to print most of the characters in Eclipse (by adjusing the Encoding). 经过一些研究,我已经能够在Eclipse中打印大多数字符(通过压缩编码)。 However I still have a bunch of unknown/unprintable characters in my output, in that when I look at the output I see all these empty rectangles for some of my printable characters. 但是,我的输出中仍然有很多未知/不可打印的字符,因为当我查看输出时,我看到了一些可打印字符的所有空白矩形。

I've tried filtering them but I can't find any way to do it. 我曾尝试过滤它们,但找不到任何方法。 FYI I'm basically just setting a character's value to 50, 100 or 1000, then incremeningt it via a for loop from there to check what characters I can or cannot (or should not?) print. 仅供参考,我基本上只是将一个字符的值设置为50、100或1000,然后通过for循环将其值递增,以检查可以或不能(或不应该?)打印哪些字符。

Can anyone give me some hints on where to start here? 谁能给我一些从这里开始的提示?

Your task is actually a little more complex than encoding because the font that you are trying to print from makes a big difference in the output. 实际上,您的任务比编码要复杂一些,因为您尝试从中打印的字体会在输出中产生很大的差异。 Ie not all fonts support the same set of characters. 即,并非所有字体都支持相同的字符集。 In fact, the support of character ranges differs wildly from font to font . 实际上, 字体范围对字符范围的支持差异很大

That said, your problem now becomes: How do I detect whether a certain font supports a given character? 就是说,您的问题现在变成:如何检测某种字体是否支持给定字符? And that question has been asked and answered ... See here for the Java doc of the canDisplay function which is a member of the Font class . 这个问题已经被提出和回答 ... 请参阅这里的canDisplay函数的Java文档,它是Font类的成员。

It is unclear what you actually and precisely mean here. 目前尚不清楚您在这里实际上是什么意思。 If you plan to play by the numbers, then Annex C of Unicode Technical Standard #18 on Unicode Regular Expressions gives the concrete suggestion that a “printable” code point be defined as any code point that has the print property, where that property is defined to be 如果您打算按数字进行操作,则有关Unicode正则表达式的Unicode技术标准#18的附件C给出了具体建议,即将“可打印”代码点定义为具有print属性的任何代码点,其中定义了该属性成为

  • \\p{print} means [[\\p{graph}\\p{blank}]&&[^\\p{gc=Control}]] \\p{print}表示[[\\p{graph}\\p{blank}]&&[^\\p{gc=Control}]]
  • \\p{graph} means [^\\p{Whitespace}\\p{gc=Control}\\p{gc=Surrogate}\\p{gc=Unassigned}] \\p{graph}表示[^\\p{Whitespace}\\p{gc=Control}\\p{gc=Surrogate}\\p{gc=Unassigned}]
  • \\p{blank} means [\\p{Whitespace}&&[^\\N{LF}\\N{VT}\\N{FF}\\N{CR}\\N{NEL}\\p{gc=Line_Separator}\\{gc=Paragraph_Separator}] \\p{blank}表示[\\p{Whitespace}&&[^\\N{LF}\\N{VT}\\N{FF}\\N{CR}\\N{NEL}\\p{gc=Line_Separator}\\{gc=Paragraph_Separator}]

Or, as the Java 1.7 Pattern class documents these , provided you compile the pattern with the new-to-Java7 Pattern.UNICODE_CHARACTER_CLASS flag enabled: 或者, 如Java 1.7 Pattern类中的文档所述 ,只要您在启用了Java7的新Pattern.UNICODE_CHARACTER_CLASS标志的情况下编译模式, 即可

  • \\p{Graph} A visible character: [^\\p{IsWhite_Space}\\p{gc=Cc}\\p{gc=Cs}\\p{gc=Cn}] \\p{Graph}可见字符: [^\\p{IsWhite_Space}\\p{gc=Cc}\\p{gc=Cs}\\p{gc=Cn}]
  • \\p{Print} A printable character: [\\p{Graph}\\p{Blank}&&[^\\p{Cntrl}]] \\p{Print}可打印的字符: [\\p{Graph}\\p{Blank}&&[^\\p{Cntrl}]]
  • \\p{Blank} A space or a tab: [\\p{IsWhite_Space}&&[^\\p{gc=Zl}\\p{gc=Zp}\\x0a\\x0b\\x0c\\x0d * \\x85]] \\p{Blank}空格或制表符: [\\p{IsWhite_Space}&&[^\\p{gc=Zl}\\p{gc=Zp}\\x0a\\x0b\\x0c\\x0d * \\x85]]
  • \\p{Cntrl} A control character: \\p{gc=Cc} \\p{Cntrl}控制字符: \\p{gc=Cc}
  • \\p{XDigit} A hexadecimal digit: [\\p{gc=Nd}\\p{IsHex_Digit}] \\p{XDigit}十六进制数字: [\\p{gc=Nd}\\p{IsHex_Digit}]
  • \\p{Space} A whitespace character: \\p{IsWhite_Space} \\p{Space}空格字符: \\p{IsWhite_Space}

On 'Printable' Characters 关于“可打印”字符

If you just use something reasonable like Java's (?U)\\p{print} pattern property (or the equivalent from the Character class), then you still have some “interesting” decisions to make. 如果仅使用Java (?U)\\p{print}模式属性(或Character类中的等效属性(?U)\\p{print}类的合理方法,那么您仍需做出一些“有趣”的决定。

Consider each of these code points: 考虑以下每个代码点:

U+000007 gc=Cc columns=0 print=0 graph=0  ALERT
U+000008 gc=Cc columns=0 print=0 graph=0  BACKSPACE
U+000009 gc=Cc columns=0 print=0 graph=0  CHARACTER TABULATION
U+00000C gc=Cc columns=0 print=0 graph=0  FORM FEED (FF)
U+00000D gc=Cc columns=0 print=0 graph=0  CARRIAGE RETURN (CR)
U+000020 gc=Zs columns=1 print=1 graph=0  SPACE
U+000021 gc=Po columns=1 print=1 graph=1  EXCLAMATION MARK
U+000041 gc=Lu columns=1 print=1 graph=1  LATIN CAPITAL LETTER A
U+000061 gc=Ll columns=1 print=1 graph=1  LATIN SMALL LETTER A
U+000080 gc=Cc columns=0 print=0 graph=0  PADDING CHARACTER
U+000085 gc=Cc columns=0 print=0 graph=0  NEXT LINE (NEL)
U+00008D gc=Cc columns=0 print=0 graph=0  REVERSE LINE FEED
U+0000AB gc=Pi columns=1 print=1 graph=1  LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
U+0000AD gc=Cf columns=0 print=1 graph=1  SOFT HYPHEN
U+0002B0 gc=Lm columns=1 print=1 graph=1  MODIFIER LETTER SMALL H
U+0002C6 gc=Lm columns=1 print=1 graph=1  MODIFIER LETTER CIRCUMFLEX ACCENT
U+000302 gc=Mn columns=0 print=1 graph=1  COMBINING CIRCUMFLEX ACCENT
U+00036A gc=Mn columns=0 print=1 graph=1  COMBINING LATIN SMALL LETTER H
U+001100 gc=Lo columns=2 print=1 graph=1  HANGUL CHOSEONG KIYEOK
U+002028 gc=Zl columns=0 print=0 graph=0  LINE SEPARATOR
U+002029 gc=Zp columns=0 print=0 graph=0  PARAGRAPH SEPARATOR
U+00202B gc=Cf columns=0 print=1 graph=1  RIGHT-TO-LEFT EMBEDDING
U+00202F gc=Zs columns=1 print=1 graph=0  NARROW NO-BREAK SPACE
U+002060 gc=Cf columns=0 print=1 graph=1  WORD JOINER
U+002061 gc=Cf columns=0 print=1 graph=1  FUNCTION APPLICATION
U+002062 gc=Cf columns=0 print=1 graph=1  INVISIBLE TIMES
U+002064 gc=Cf columns=0 print=1 graph=1  INVISIBLE PLUS
U+002EC1 gc=So columns=2 print=1 graph=1  CJK RADICAL TIGER
U+002F0B gc=So columns=2 print=1 graph=1  KANGXI RADICAL EIGHT
U+003000 gc=Zs columns=2 print=1 graph=0  IDEOGRAPHIC SPACE
U+003008 gc=Ps columns=2 print=1 graph=1  LEFT ANGLE BRACKET
U+00300A gc=Ps columns=2 print=1 graph=1  LEFT DOUBLE ANGLE BRACKET
U+00300C gc=Ps columns=2 print=1 graph=1  LEFT CORNER BRACKET
U+00302B gc=Mn columns=0 print=1 graph=1  IDEOGRAPHIC RISING TONE MARK
U+003030 gc=Pd columns=2 print=1 graph=1  WAVY DASH
U+003037 gc=So columns=2 print=1 graph=1  IDEOGRAPHIC TELEGRAPH LINE FEED SEPARATOR SYMBOL
U+003041 gc=Lo columns=2 print=1 graph=1  HIRAGANA LETTER SMALL A
U+00E000 gc=Co columns=1 print=1 graph=1 <unnamed codepoint in blk=Private_Use_Area>
U+00F8FF gc=Co columns=1 print=1 graph=1 <unnamed codepoint in blk=Private_Use_Area>
U+00FB1E gc=Mn columns=0 print=1 graph=1  HEBREW POINT JUDEO-SPANISH VARIKA
U+00FE00 gc=Mn columns=0 print=1 graph=1  VARIATION SELECTOR-1
U+00FE23 gc=Mn columns=0 print=1 graph=1  COMBINING DOUBLE TILDE RIGHT HALF
U+00FE58 gc=Pd columns=2 print=1 graph=1  SMALL EM DASH
U+00FE77 gc=Lo columns=1 print=1 graph=1  ARABIC FATHA MEDIAL FORM
U+00FEFF gc=Cf columns=0 print=1 graph=1  ZERO WIDTH NO-BREAK SPACE
U+00FF06 gc=Po columns=2 print=1 graph=1  FULLWIDTH AMPERSAND
U+00FFFA gc=Cf columns=0 print=1 graph=1  INTERLINEAR ANNOTATION SEPARATOR
U+00FFFD gc=So columns=1 print=1 graph=1  REPLACEMENT CHARACTER
U+01B000 gc=Lo columns=2 print=1 graph=1  KATAKANA LETTER ARCHAIC E
U+01D165 gc=Mc columns=1 print=1 graph=1  MUSICAL SYMBOL COMBINING STEM
U+01D167 gc=Mn columns=0 print=1 graph=1  MUSICAL SYMBOL COMBINING TREMOLO-1
U+100002 gc=Co columns=1 print=1 graph=1 <unnamed codepoint in blk=Supplementary_Private_Use_Area-B>

Some of them are quite conditional as to what, and perhaps even whether, they print. 其中一些对于打印什么,甚至打印是否有条件。 For example, what does U+F8FF's ‹› look like to you? 例如,U + F8FF的‹›对您来说是什么样?

Then you have to decide how to handle tabs and backspace. 然后,您必须决定如何处理制表符和退格键。

Plus you will have to consider the various \\p{Grapheme_Extend} code points used to build up a Unicode extended grapheme cluster ; 另外,您将不得不考虑用于构建Unicode 扩展字素簇的各种\\p{Grapheme_Extend}代码点; that is, a user-visible character . 用户可见的字符 Not all of these are nonspacing marks. 并非所有这些都是非间距标记。 In fact, some aren't marks at all, but letters! 实际上,有些根本不是标记,而是字母! Some are not printable characters at all, and yet they change the printable \\p{Grapheme_Base} character to which they are ineluctably attached; 有些根本不是可打印的字符,但它们更改了它们不可避免地附加到的可打印\\p{Grapheme_Base}字符; consider just as one example the variation selectors. 仅以变体选择器为例。

Warning 警告

Which brings us to a critically important point that is far too often forgotten by would-be Java programmers, and even when not wholly forgotten, is usually underappreciated. 这使我们到达了至关重要的地步,这常常被潜在的Java程序员所遗忘,甚至在未被完全遗忘的情况下,也常常未被重视。

Always, always, always remember that Java characters are not Unicode characters! 永远,永远,永远记住Java字符不是Unicode字符! There are two reasonable definitions of a Unicode character, and Java gives you neither. Unicode字符有两个合理的定义,而Java却没有给您两个定义。 Here are the two reasonable definitions: 这是两个合理的定义:

  • If a character is a character in the programmer-visible sense, then a character is a Unicode code point. 如果字符是程序员可见的字符,则字符是Unicode代码点。 This is what . 这是什么. matches in the regex engine, for example, no matter whether you use Sun's or ICU's. 例如,无论您使用的是Sun还是ICU,都可以在正则表达式引擎中进行匹配。
  • If a character is a character in the user-visible sense, then a character is a Unicode extended grapheme cluster. 如果字符是用户可见的字符,则该字符是Unicode扩展字素簇。 This is what \\X matches in the (ICU not Sun) regex engine. 这是\\X在(ICU而非Sun)正则表达式引擎中匹配的内容。

A Java so-called “character” is a low-level, breaks-the-abstract 2-byte element of a variable-width UTF-16 representation of an actual Unicode code point. Java所谓的“字符”是实际Unicode代码点的可变宽度UTF-16表示形式的低级,抽象的2字节元素。 It is neither an abstract code point nor an abstract grapheme. 它既不是抽象代码点,也不是抽象字形。 It is not an abstract anything. 它不是抽象的东西。 A Java char is a violation of the envelope of abstraction. Java char违反了抽象信封。

Yes, some Java classes give you a codePointAt interface, and you should absolutely use those wherever those are available. 是的,一些Java类为您提供了codePointAt接口,并且您绝对应该在可用的地方使用它们。 But in many ways that it takes too long to explain here, Java is fundamentally broken in its character abstraction — because it doesn't have one. 但是在很多方面,这里需要花太长时间来解释, Java的字符抽象从根本上被破坏了 -因为它没有一个。

This makes working with Unicode characters and strings at best error-prone in Java, and often next to impossible. 这使得在Java中使用Unicode字符和字符串的方式最容易出错,并且几乎几乎是不可能的。

Good luck. 祝好运。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM