簡體   English   中英

確定 unicode 字符在 Perl 中是全角還是半角

[英]determine whether a unicode character is fullwidth or halfwidth in Perl

如何在 Perl 中確定 unicode 字符是全角(占用兩個單元格;雙倍寬度)一個還是半角(像往常一樣的拉丁字符)一個?

例如,表情符號是雙倍寬度的,但在較低的塊中也有字符,例如"\N{MEDIUM BLACK CIRCLE}" (U+26ab)。

我試過了

Unicode::GCString->new("\N{LARGE RED CIRCLE}")->columns()

但它也返回 1。

這有點混亂,我很猶豫是否將其發布到互聯網上而不將其清理到適當的庫中......但我不太可能有時間制作那個庫,所以在這里以防萬一它有用. 它很大程度上源於Shawn 的貢獻,但它沒有使用可能增長到數百萬個條目的每個代碼點“緩存”,而是使用 Unicode::UCD 數據來構建代碼點范圍的“invmap”及其在第一次調用時的相關寬度; 查詢 map 的工作方式類似於(並且成本與單個charprop調用相同或略低)。

map_im采用 prop_invmap 返回的prop_invmap ,並通過 hash 映射屬性值。 在 hash 中找不到的任何值都將變為undef ,Unicode::UCD 不使用它,但我們的代碼將其視為“無關緊要”。 merge_im采用兩個這樣的 invmap 並將它們合並,以便“右側”invmap 中的值覆蓋“左側”invmap 中的值,但右側的 undef 范圍允許左側值“穿透”。 charwidth 's state initialization maps and merges three invmaps (those for East_Asian_Width, Category, and a special-case override list) according to the logic from Shawn's own charwidth , and the function simply queries that using Unicode::UCD's own search_invlist routine.

初始化在我的筆記本電腦上花費了 <60ms 並產生了一個 909 元素的 invmap(使用來自 perl 5.32.1 的 UCD),之后每次調用大約需要 2.5us。

use Unicode::UCD;
use open qw/:std :locale/;
use charnames qw/:full/;
use feature 'state';
use List::Util 'reduce';

sub map_im {
  my ($im, $h) = @_;
  die unless $im->[2] eq 's';

  my $out;

  for my $i (0 .. $#{ $im->[0] }) {
    my $val = $h->{ $im->[1][$i] };

    my $different = @{ $out->[0] } ? ($val ne $out->[1][-1]) : defined($val);
    if ($different) {
      push @{ $out->[0] }, $im->[0][$i];
      push @{ $out->[1] }, $val;
    }
  }

  # Remove "don't care" entries at the beginning
  while (!defined $out->[1][0]) {
    shift @{ $out->[0] };
    shift @{ $out->[1] };
  }

  return $out;
}

sub merge_im {
  my ($l, $r) = @_;
  die unless $l->[0][0] == 0;

  my $out;
  my $idx_l = my $idx_r = 0;
  my $val_l = $l->[1][0];
  my $val_r;

  while ($idx_r < @{ $r->[0] } || $idx_l < @{ $l->[0] }) {
    my $newcp;
    # Take from the list with the lower next entry. Or the one with entries left.
    # This could probably be simplified.
    if ($idx_r >= @{ $r->[0] } || ($idx_l < @{ $l->[0] } 
        && $l->[0][$idx_l] <= $r->[0][$idx_r])) {
      $newcp = $l->[0][$idx_l];
      $val_l = $l->[1][$idx_l];
      $idx_l ++;
    } else {
      $newcp = $r->[0][$idx_r];
      $val_r = $r->[1][$idx_r];
      $idx_r ++;
    }

    # But if they both have a transition at the same codepoint, take both so there's
    # not a duplicate.
    if ($idx_r < @{ $r->[0] } && $r->[0][$idx_r] == $newcp) {
      $val_r = $r->[1][$idx_r];
      $idx_r ++;
    }

    my $newval = defined($val_r) ? $val_r : $val_l;
    # This gets skipped if we updated $val_l but $val_r is overriding, or
    # $val_r went from undef to equaling $val_l.
    if ($newval ne $out->[1][-1]) {
      push @{ $out->[0] }, $newcp;
      push @{ $out->[1] }, $newval;
    }
  }

  return $out;
}

sub charwidth {
  state $width_eaw = map_im([Unicode::UCD::prop_invmap('East_Asian_Width')], 
    { F => 2, W => 2, H => 1, Na => 1, Neutral => 1, A => 1 }
  );

  state $width_cat = map_im([Unicode::UCD::prop_invmap('Category')],
    { Cc => 0, Mn => 0, Me => 0, Cf => 0 }
  );

  state $width_override = [
    [ 0x0000, 0x0001, # NUL
      0x00AD, 0x00AE, # Soft Hyphen
      0x1160, 0x1200, # Hangul Jamo vowels and final consonants
      0x200B, 0x200C, # ZWSP
    ],
    [ 0, undef,
      1, undef,
      0, undef,
      0, undef,
    ],
  ];

  state $merged = reduce { merge_im($a, $b) } $width_eaw, $width_cat, $width_override;

  my $cp = shift;
  my $idx = Unicode::UCD::search_invlist($merged->[0], $cp);
  return $merged->[1][$idx];
}

sub testwidth($) {
  my $char = shift;
  my $cp = ord $char;
  printf "Width of %c (U+%04X %s) is %d\n", $cp, $cp,
     charnames::viacode($cp), charwidth($cp);
}

testwidth "\x04";
testwidth "a";
testwidth "\N{MEDIUM BLACK CIRCLE}";
testwidth "\N{LARGE RED CIRCLE}";
testwidth "\N{U+20A9}";
testwidth "\N{U+1F637}";

我有一些 C++ 代碼用於計算字符寬度。 因此,稍后快速轉換為 perl,然后...

#!/usr/bin/env perl
use warnings;
use strict;
use feature qw/state/;
use open qw/:std :locale/;
use charnames qw/:full/;
use Unicode::UCD qw/charinfo charprop/;

# Return the number of fixed-width columns taken up by a unicode codepoint
# Inspired by https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
# First adapted to use C++/ICU functions and then to perl
sub charwidth ($) {
  state %cache;

  my $cp = shift; # Numeric codepoint
  return $cache{$cp} if exists $cache{$cp};

  if ($cp == 0 || $cp == 0x200B) {
    # nul and ZERO WIDTH SPACE
    $cache{$cp} = 0;
    return 0;
  } elsif ($cp >= 0x1160 && $cp <= 0x11FF) {
    # Hangul Jamo vowels and final consonants
    $cache{$cp} = 0;
    return 0;
  } elsif ($cp == 0xAD) {
    # SOFT HYPHEN
    $cache{$cp} = 1;
    return 1;
  }

  my $ci = charinfo($cp);
  return undef unless defined $ci;

  my $type = $ci->{category};
  if ($type eq "Cc" || $type eq "Mn" || $type eq "Me" || $type eq "Cf") {
    # Control Code, Non Spacing Mark, Enclosing Mark, Format Char
    $cache{$cp} = 0;
    return 0;
  }

  state $widths = { Fullwidth => 2, Wide => 2, Halfwidth => 1, Narrow => 1,
                    Neutral => 1, Ambiguous => 1 };
  my $eaw = charprop($cp, "East_Asian_Width");
  my $width = $widths->{$eaw} // 1;
  $cache{$cp} = $width;
  return $width;
}

sub testwidth ($) {
  my $char = shift;
  my $cp = ord $char;
  printf "Width of %c (U+%04X %s) is %d\n", $cp, $cp, charnames::viacode($cp),
    charwidth($cp);
}

testwidth "\x04";
testwidth "a";
testwidth "\N{MEDIUM BLACK CIRCLE}";
testwidth "\N{LARGE RED CIRCLE}";
testwidth "\N{U+20A9}";
testwidth "\N{U+1F637}";

示例使用:

$ ./charwidths.pl
Width of  (U+0004 END OF TRANSMISSION) is 0
Width of a (U+0061 LATIN SMALL LETTER A) is 1
Width of ⚫ (U+26AB MEDIUM BLACK CIRCLE) is 2
Width of 🔴 (U+1F534 LARGE RED CIRCLE) is 2
Width of ₩ (U+20A9 WON SIGN) is 1
Width of 😷 (U+1F637 FACE WITH MEDICAL MASK) is 2

它只是對特定范圍和代碼點類別進行一些特殊情況檢查,然后使用 East Asian Width 屬性以及TR11的建議來確定其他所有內容的寬度。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM