简体   繁体   中英

How can I determine the width of a Unicode character

me and a friend are programming our own console in java, but we have Problems to adjust the lines correctly, because of the width of the unicode characters which can not be determined exactly. This leads to the problem that not only the line of the unicode, but also following lines are shifted.

Is there a way to determine the width of the unicodes?

Screenshots of the problem can be found bellow.

This is how it should look: https://abload.de/img/richtigslkmg.jpeg

This is an example in Terminal: https://abload.de/img/terminal7dj5o.jpeg

This is an example in PowerShell: https://abload.de/img/powershelln7je0.jpeg

This is an example in Visual Studio Code: https://abload.de/img/visualstudiocode4xkuo.jpeg

This is an example in Putty: https://abload.de/img/putty0ujsk.png

EDIT:

I am sorry that the question was unclear.

It is about the display width, in the example I try to determine the display length to have each line the same length. The function real_length is to calculate/determine and return the display width.

here the example code:

public static void main(String[] args) {
    String[] tests = {
        "Peter",
        "SHGAMI",
        "Marcel №1",
        "💏",
        "👨‍❤️‍👨",
        "👩‍❤️‍💋‍👩",
        "👨‍👩‍👦"
    };
    for(String test : tests) test(test);
}

public static void test(String text) {
    int max = 20;
    for(int i = 0; i < max;i++) System.out.print("#");
    System.out.println();
    System.out.print(text);
    int length = real_length(text);
    for(int i = 0; i < max - length;i++) System.out.print("#");
    System.out.println();
}

public static int real_length(String text) {
    return text.length();
}

tl;dr

Use code points rather than char . Avoid calling String#length .

input 
+ 
"#".repeat( targetLength - input.codePoints().toArray().length ) 

Details

Your Question neglected to show any code. So I can only guess what you are doing and what might be the problem.

Avoid char

I am guessing that your goal is to append a certain number of NUMBER SIGN characters as needed to make a fixed-length row of text.

I am guessing the problem is that you are using the legacy char type, or its wrapper class Character . The char type has been essentially broken since Java 2. As a 16-bit value, char is physically incapable of representing most characters.

Use code point numbers

Instead, use code point integer numbers when working with individual characters. A code point is the number permanently assigned to each of the over 140,000 characters defined in Unicode .

A variety of code point related methods have been added to various classes in Java 5+: String , StringBuilder , Character , etc.

Here we use String#codePoints to get an IntStream of code points, one element for each character in the source. And we use StringBuilder#appendCodePoint to collect the code points for our final result string.

final int targetLength = 10;
final int fillerCodePoint = "#".codePointAt( 0 ); // Annoying zero-based index counting.
String input = "😷🤠🤡";

int[] codePoints = input.codePoints().toArray();
StringBuilder stringBuilder = new StringBuilder();
for ( int index = 0 ; index < targetLength ; index++ )
{
    if ( index < codePoints.length )
    {
        stringBuilder.appendCodePoint( codePoints[ index ] );
    } else
    {
        stringBuilder.appendCodePoint( fillerCodePoint );
    }
}

Or, shorten that for loop with the use of a ternary operator .

for ( int index = 0 ; index < targetLength ; index++ )
{
    int codePoint = ( index < codePoints.length ) ? codePoints[ index ] : fillerCodePoint;
    stringBuilder.appendCodePoint( codePoint );
}

Report result.

System.out.println( Arrays.toString( codePoints ) );
String output = stringBuilder.toString();
System.out.println( "output = " + output );

[128567, 129312, 129313]

output = 😷🤠🤡#######


There is likely a clever way to write that code more briefly with streams and lambdas, but I cannot think of one at the moment.

And, one could cleverly use the String#repeat method in Java 11+.

String output = input + "#".repeat( targetLength - input.codePoints().toArray().length ) ;

Unfortunately there is no easy solution to your deceptively simple question, for several reasons:

  • The width of the characters being rendered on the console might (and probably will) vary, based on the font being used. So the code would need to determine, or assume, the target font in order to calculate widths.

  • System.out is just a PrintStream that does not know or care about fonts and character width, so any solution has to be independent of that.

  • Even if you could determine the font being used on the console, and you had a way to determine the width of each character you were trying to render in that specific font, how would that help you? Knowing the variation in widths might conceivably allow you to cleverly tweak the lines being rendered so that they were aligned, but it's just as likely that it wouldn't be practicable.

  • A potential solution is to leave your code as it stands, and use a monospaced font on the console that println() is writing to, but there are still some major problems with that approach. First, you need to identify a font that is monospaced, but will also support all of the characters you want to render. This can be problematic when including emojis. Second, even if you identify such a font, you may find that all the glyphs for that font are not monospaced! Such a font will ensure that (say) a lowercase i and an uppercase W have the same width, but you can't also make that assumption for emojis, and you can't even assume that the "monospaced" emojis will all have the same non-standard width! Third, the font you identify (if it exists at all) would have to be available in your target environments (your PowerShell, your friend's PuTTY shell, etc.). That is not a major obstacle, but it is one more thing to worry about.

  • You may find that the rendered text varies by operating system. Your output may look aligned in a Linux terminal window, but that same output, using the same font, might be misaligned in a PowerShell window.

Given all that, a better approach might be to use Swing or JavaFX, where you have finer control over the output being rendered. Even if you are unfamiliar with those technologies, it wouldn't take too long to get something working, just by tweaking some sample code obtained through a search. And even allowing for the learning curve, it would still take less time than coming up with a robust solution for aligning arbitrary characters written to an arbitrary console, because that is a hard problem to solve.

Notes:

Sounds like you're looking for a Java implementation of the POSIX wcwidth and wcswidth functions, which implement the rules defined in Unicode Technical Report #11 (which exclusively focuses on display widths for Unicode codepoints when rendered to fixed width devices - terminals and the like). The only such Java implementation that I'm aware of is in the JLine3 library , which is a lot of code to bring in for just this one class, but that may be your best bet.

Note however that that code appears to be incomplete. Unicode codepoint 0x26AA (⚪️), for example, is reported as having a width of 1 by the JLine3 code, but on every platform I've tested on (including here in the StackOverflow editor, which is a fixed width "device") that codepoint is displayed over two columns.

Good luck - this stuff is a lot more complex than it looks. The JVM's unfortunate UCS-2 history (not Sun's fault - it was bad timing wrt the Unicode standard) only makes matters worse, and as others have said here, avoid the char and Character data types like the plague - they do not work the way you expect, and the instant code that uses those types encounters data including codepoints from the Unicode supplemental planes, it is almost certain to function incorrectly (unless the author has been especially careful - do you feel lucky? 😉).

Note: This answer is distinct and qualitatively different from my earlier one (which I still stand by).

There is a simple way for a Java application (ie one not using a graphical user interface) to obtain the width of a String being rendered in a given font with a given font size. It requires the use of some awt classes which are supported even in a non-AWT environment. Here's a demo using the data provided in the question:

package fixedwidth;

import java.awt.Canvas;
import java.awt.Font;
import java.awt.FontMetrics;

public class FixedWidth {

    static String[] tests = {
        "Peter", "SHGAMI", "Marcel №1", "💏", "👨‍❤️‍👨", "👩‍❤️‍💋‍👩", "👨‍👩‍👦"
    };
    static Font smallFont = new Font("Monospaced", Font.PLAIN, 10);
    static Font bigFont = new Font("Monospaced", Font.BOLD, 24);

    /**
     * This code is based on an answer by SO user Lonzak. 
     * See SO Answer https://stackoverflow.com/a/18123024/2985643
     */
    public static void main(String[] args) {
        FontMetrics fm1 = new Canvas().getFontMetrics(FixedWidth.smallFont);
        FixedWidth.demo(tests, fm1);

        FontMetrics fm2 = new Canvas().getFontMetrics(FixedWidth.bigFont);
        FixedWidth.demo(tests, fm2);
    }

    static void demo(String[] tests, FontMetrics fm) {
        Font f = fm.getFont();
        System.out.println("\nFont name:" + f.getName() + ", font size:" + 
                f.getSize() + ", font style:" + f.getStyle());
        for (String test : tests) {
            int width = fm.stringWidth(test);
            System.out.println("width=" + width + ", data=" + test);
        }
    }
}

The code above is based on this old answer by user Lonzak to the question Java - FontMetrics without Graphics . Those AWT classes allow you to create a Font with defined characteristics (ie name, size, style), and then use a FontMetrics instance to obtain the width of an arbitrary String when using that font.

Here is the output from running the code shown above:

Font name:Monospaced, font size:10, font style:0
width=30, data=Peter
width=60, data=SHGAMI
width=59, data=Marcel №1
width=10, data=💏
width=30, data=👨‍❤️‍👨
width=40, data=👩‍❤️‍💋‍👩
width=30, data=👨‍👩‍👦

Font name:Monospaced, font size:24, font style:1
width=70, data=Peter
width=149, data=SHGAMI
width=140, data=Marcel №1
width=25, data=💏
width=73, data=👨‍❤️‍👨
width=98, data=👩‍❤️‍💋‍👩
width=74, data=👨‍👩‍👦

Notes:

  • The first set of results shows the widths of the sample data in the question when using plain Monospaced 10 point font. The second set of results shows the widths of those same strings when using bold Monospaced 24 point font.

  • The widths don't look correct for some of the emojis, but that is because when the source code and output results are pasted into SO some emoji representations are changed, presumably because of the different font being used in the browser. (I was using Monospaced for both the source and the output.) Here's a screen shot of the original output, showing that the widths at least look plausible:

    IDE输出

  • Even though the widths are being calculated and rendered for a fixed width font ( Monospaced ), it's clear that the width of the emojis cannot be predicted from the widths of normal keyboard characters.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM