简体   繁体   中英

determine whether a unicode character is fullwidth or halfwidth in Perl

How can I determine in Perl, whether a unicode character is a full width (taking two cells; double width) one or half-width (as usual latin characters) one?

Eg Emoji-s are of double width, but there are also characters in lower blocks such as "\N{MEDIUM BLACK CIRCLE}" (U+26ab).

I tried

Unicode::GCString->new("\N{LARGE RED CIRCLE}")->columns()

but it also returns 1.

This is a little messy and I'm hesitant to launch it onto the internet without cleaning it up into a proper library... but I'm unlikely to ever have the time to make that library, so here it is in case it's useful. It's very much derived from Shawn's contribution but instead of using a per-codepoint "cache" that could grow to millions of entries, it uses the Unicode::UCD data to build an "invmap" of codepoint ranges and their associated widths on first call; querying that map works like (and costs the same as or slightly less than) a single charprop call.

map_im takes an invmap as returned by prop_invmap and maps the property values through a hash. Any value not found in the hash will become an undef , which isn't used by Unicode::UCD, but is treated by our code as a "don't care". merge_im takes two such invmaps and merges them such that values in the "right" invmap override values in the "left" invmap, but undef ranges on the right side allow the left-side values to "shine through". charwidth 's state initialization maps and merges three invmaps (those for East_Asian_Width, Category, and a special-case override list) according to the logic from Shawn's own charwidth , and the function simply queries that using Unicode::UCD's own search_invlist routine.

The initialization takes <60ms on my laptop and produces a 909-element invmap (using the UCD from perl 5.32.1), and querying takes about 2.5us per call afterwards.

use Unicode::UCD;
use open qw/:std :locale/;
use charnames qw/:full/;
use feature 'state';
use List::Util 'reduce';

sub map_im {
  my ($im, $h) = @_;
  die unless $im->[2] eq 's';

  my $out;

  for my $i (0 .. $#{ $im->[0] }) {
    my $val = $h->{ $im->[1][$i] };

    my $different = @{ $out->[0] } ? ($val ne $out->[1][-1]) : defined($val);
    if ($different) {
      push @{ $out->[0] }, $im->[0][$i];
      push @{ $out->[1] }, $val;
    }
  }

  # Remove "don't care" entries at the beginning
  while (!defined $out->[1][0]) {
    shift @{ $out->[0] };
    shift @{ $out->[1] };
  }

  return $out;
}

sub merge_im {
  my ($l, $r) = @_;
  die unless $l->[0][0] == 0;

  my $out;
  my $idx_l = my $idx_r = 0;
  my $val_l = $l->[1][0];
  my $val_r;

  while ($idx_r < @{ $r->[0] } || $idx_l < @{ $l->[0] }) {
    my $newcp;
    # Take from the list with the lower next entry. Or the one with entries left.
    # This could probably be simplified.
    if ($idx_r >= @{ $r->[0] } || ($idx_l < @{ $l->[0] } 
        && $l->[0][$idx_l] <= $r->[0][$idx_r])) {
      $newcp = $l->[0][$idx_l];
      $val_l = $l->[1][$idx_l];
      $idx_l ++;
    } else {
      $newcp = $r->[0][$idx_r];
      $val_r = $r->[1][$idx_r];
      $idx_r ++;
    }

    # But if they both have a transition at the same codepoint, take both so there's
    # not a duplicate.
    if ($idx_r < @{ $r->[0] } && $r->[0][$idx_r] == $newcp) {
      $val_r = $r->[1][$idx_r];
      $idx_r ++;
    }

    my $newval = defined($val_r) ? $val_r : $val_l;
    # This gets skipped if we updated $val_l but $val_r is overriding, or
    # $val_r went from undef to equaling $val_l.
    if ($newval ne $out->[1][-1]) {
      push @{ $out->[0] }, $newcp;
      push @{ $out->[1] }, $newval;
    }
  }

  return $out;
}

sub charwidth {
  state $width_eaw = map_im([Unicode::UCD::prop_invmap('East_Asian_Width')], 
    { F => 2, W => 2, H => 1, Na => 1, Neutral => 1, A => 1 }
  );

  state $width_cat = map_im([Unicode::UCD::prop_invmap('Category')],
    { Cc => 0, Mn => 0, Me => 0, Cf => 0 }
  );

  state $width_override = [
    [ 0x0000, 0x0001, # NUL
      0x00AD, 0x00AE, # Soft Hyphen
      0x1160, 0x1200, # Hangul Jamo vowels and final consonants
      0x200B, 0x200C, # ZWSP
    ],
    [ 0, undef,
      1, undef,
      0, undef,
      0, undef,
    ],
  ];

  state $merged = reduce { merge_im($a, $b) } $width_eaw, $width_cat, $width_override;

  my $cp = shift;
  my $idx = Unicode::UCD::search_invlist($merged->[0], $cp);
  return $merged->[1][$idx];
}

sub testwidth($) {
  my $char = shift;
  my $cp = ord $char;
  printf "Width of %c (U+%04X %s) is %d\n", $cp, $cp,
     charnames::viacode($cp), charwidth($cp);
}

testwidth "\x04";
testwidth "a";
testwidth "\N{MEDIUM BLACK CIRCLE}";
testwidth "\N{LARGE RED CIRCLE}";
testwidth "\N{U+20A9}";
testwidth "\N{U+1F637}";

I have some C++ code lying around to calculate character widths. So, a quick conversion to perl later, and...

#!/usr/bin/env perl
use warnings;
use strict;
use feature qw/state/;
use open qw/:std :locale/;
use charnames qw/:full/;
use Unicode::UCD qw/charinfo charprop/;

# Return the number of fixed-width columns taken up by a unicode codepoint
# Inspired by https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
# First adapted to use C++/ICU functions and then to perl
sub charwidth ($) {
  state %cache;

  my $cp = shift; # Numeric codepoint
  return $cache{$cp} if exists $cache{$cp};

  if ($cp == 0 || $cp == 0x200B) {
    # nul and ZERO WIDTH SPACE
    $cache{$cp} = 0;
    return 0;
  } elsif ($cp >= 0x1160 && $cp <= 0x11FF) {
    # Hangul Jamo vowels and final consonants
    $cache{$cp} = 0;
    return 0;
  } elsif ($cp == 0xAD) {
    # SOFT HYPHEN
    $cache{$cp} = 1;
    return 1;
  }

  my $ci = charinfo($cp);
  return undef unless defined $ci;

  my $type = $ci->{category};
  if ($type eq "Cc" || $type eq "Mn" || $type eq "Me" || $type eq "Cf") {
    # Control Code, Non Spacing Mark, Enclosing Mark, Format Char
    $cache{$cp} = 0;
    return 0;
  }

  state $widths = { Fullwidth => 2, Wide => 2, Halfwidth => 1, Narrow => 1,
                    Neutral => 1, Ambiguous => 1 };
  my $eaw = charprop($cp, "East_Asian_Width");
  my $width = $widths->{$eaw} // 1;
  $cache{$cp} = $width;
  return $width;
}

sub testwidth ($) {
  my $char = shift;
  my $cp = ord $char;
  printf "Width of %c (U+%04X %s) is %d\n", $cp, $cp, charnames::viacode($cp),
    charwidth($cp);
}

testwidth "\x04";
testwidth "a";
testwidth "\N{MEDIUM BLACK CIRCLE}";
testwidth "\N{LARGE RED CIRCLE}";
testwidth "\N{U+20A9}";
testwidth "\N{U+1F637}";

Example use:

$ ./charwidths.pl
Width of  (U+0004 END OF TRANSMISSION) is 0
Width of a (U+0061 LATIN SMALL LETTER A) is 1
Width of ⚫ (U+26AB MEDIUM BLACK CIRCLE) is 2
Width of 🔴 (U+1F534 LARGE RED CIRCLE) is 2
Width of ₩ (U+20A9 WON SIGN) is 1
Width of 😷 (U+1F637 FACE WITH MEDICAL MASK) is 2

It just does some special-case checks of particular ranges and categories of codepoints, and then uses the East Asian Width property along with recommendations from TR11 to determine width of everything else.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM