简体   繁体   中英

Analyzing full width or half width character in Java

I would like to analyze full width or half width character in char array.

for example:

char [] password = {'t','e','s','t','思','題'};

There are full width and half width characters in this char array.

half width = t,e,s,t

full width = 思,題

So, how can I analyze full width or half width for char array in java?

Thanks a lot!

JDK contains one class that mentions full/half width: InputSubset

http://docs.oracle.com/javase/7/docs/api/java/awt/im/InputSubset.html

Unfortunately there's no method to check which char falls in which subset.

Nonetheless, apparently full/half width is a well defined concept for unicodes. There maybe an accurate spec somewhere on internet.

http://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms

http://en.wikipedia.org/wiki/DBCS

I guess it'll be good enough for your use case to say that, 0x00-0xFF chars are half-width; other chars are full-width, except the half-width chars in the unicode block "Halfwidth and Fullwidth Forms"

boolean isHalfWidth(char c)
{
    return '\u0000' <= c && c <= '\u00FF'
        || '\uFF61' <= c && c <= '\uFFDC'
        || '\uFFE8' <= c && c <= '\uFFEE' ;
}

The visible width of a character really depends on the font that you view it in, and the characters in Java are abstract with respect to fonts.

If you're looking to determine whether a particular character is a CJK (or language subset etc.) character, you might try finding the bit-pattern range that those characters take in UTF-16 (I think that's what java uses?) and making sure that each char value falls within that range.

I may be completely barking up the wrong tree here though, so let me know if this is what you're after.

EDIT : actually, now I'm not sure that the java encoding is entirely abstract, after looking at trashgod's link. The char comparisons may still be a good way to go, though, as there are definitions of full-width hex codes in the character documentation.

You appear to be talking about the number of bits in the internal representation of a character, as opposed to the "visible width" referred to in another answer.

The Character class and the char primitive type in Java both use standard Unicode; it handles latin, Chinese, and many other languages. Some unicode characters are 16 bits; some are more.

So I think the answer to your question is: go ahead and analyze however you want -- your array contains some 16-bit values and probably some values greater than 16 bits. Without knowing more about what you want to do with the characters, it is hard to be any more explicit.

EDIT: my mistake, the char primitive only handles 16-bit unicode values. But an array of Character objects would handle unicode values greater than 16 bits.

The width of an East Asian character is described in Annex #11 of the Unicode Standard which talks about the East_Asian_Width property of a Unicode character.

Although, I could find no way of inquiring this property using standard Java 8 libraries, one can use the ICU4J library ( com.ibm.icu.icu4j in Maven) to get this value.

For example, the following code returns UCharacter.EastAsianWidth.WIDE :

int esw = UCharacter.getIntPropertyValue('あ', UProperty.EAST_ASIAN_WIDTH);

Some testing with Japanese characters has shown that all single-byte Shift JIS kana characters (eg halfwidth ) are designated HALFWIDTH , while their fullwidth counterparts (eg ) are designated FULLWIDTH . All other fullwidth characters, such as あいうえお return WIDE , and non-fullwidth characters such as plain Abc return NARROW .

The value AMBIGUOUS needs some extra care because its widths will vary depending on context. For instance, the vim editor has an ambiwidth option to let the user choose whether it should be treated narrow or wide, since rendering is terminal dependent.

The aforementioned annex states for ambiguous characters : Ambiguous characters occur in East Asian legacy character sets as wide characters, but as narrow (ie, normal-width) characters in non-East Asian usage.

It also states for NEUTRAL : Strictly speaking, it makes no sense to talk of narrow and wide for neutral characters, but because for all practical purposes they behave like Na, they are treated as narrow characters (the same as Na) under the recommendations below.

However, I have found the Narrow for NEUTRAL not always the case, as some characters can appear wide in editors I have tried. Furthermore, , , , are AMBIGUOUS , while the proceeding characters and are NEUTRAL and this doesn't seem to make sense. Perhaps characters not mapped in icu4j fall back to NEUTRAL .

Lastly, UCharacter.EastAsianWidth.COUNT is just a constant representing the number of properties defined under UCharacter.EastAsianWidth , and not a value getIntPropertyValue() will return.

It really depends on how you define what full width character is. The internal representation of Java String is UTF-16, so each of the character is ranged from 1 to 2^16. If you define full width character using the definition of unicode , you can just check whether the char is within the range of the block of full width chracter of unicode. But that block do not includes some common text in Chinese such as ‵。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM