简体   繁体   English

在 Ruby 中枚举一个角色的 Unicode 属性?

[英]Enumerate a character's Unicode properties in Ruby?

Is there any way to enumerate all of a character's Unicode properties in Ruby?有没有办法在 Ruby 中枚举一个角色的所有 Unicode 属性? I can use Ruby 1.9's Regexp class to test whether a given character has a particular property (eg, some_char =~ /\p{P}/ to test whether some_char is punctuation, etc.)... but since characters can have multiple properties ( ( , for example, is both punctuation and ASCII, etc.), it would be nice to just be able to get a list of all of a character's properties.我可以使用 Ruby 1.9 的正则表达式 class 来测试给定字符是否具有特定属性(例如, some_char =~ /\p{P}/来测试some_char是否是标点符号等)...但是由于字符可以有多个属性( ( ,例如,既是标点符号是 ASCII 等),如果能够获得一个字符所有属性的列表就很好了。

I could probably do this by hand using unicode_data.txt , or whatever it's called, but this seems like the sort of thing that's probably already been done somewhere.我可能可以使用unicode_data.txt或其他任何名称手动执行此操作,但这似乎已经在某处完成了。 UnicodeUtils doesn't appear to have anything along these lines, and Googling didn't turn up anything obvious. UnicodeUtils似乎没有任何类似的东西,谷歌搜索也没有发现任何明显的东西。 Thanks!谢谢!

You can call out to my uniprops script .您可以调用我的uniprops 脚本

$ uniprops -p delta greek:delta Greek:Delta
    U+1E9F ‹ẟ› \N{ LATIN SMALL LETTER DELTA }:
        \w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
    U+03B4 ‹δ› \N{ GREEK SMALL LETTER DELTA }:
        \w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
    U+0394 ‹Δ› \N{ GREEK CAPITAL LETTER DELTA }:
        \w \pL \p{LC} \p{L_} \p{L&} \p{Lu}

$ uniprops \# ç π
    U+0023 ‹#› \N{ NUMBER SIGN }:
        \pP \p{Po}
        All Any ASCII Assigned Common Zyyy Po P Gr_Base
           Grapheme_Base Graph GrBase Other_Punctuation Punct Pat_Syn
           Pattern_Syntax PatSyn PosixGraph PosixPrint PosixPunct
           Print Punctuation
    U+00E7 ‹ç› \N{ LATIN SMALL LETTER C WITH CEDILLA }:
        \w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
        All Any Alnum Alpha Alphabetic Assigned InLatin1 Cased
           Cased_Letter LC Changes_When_Casemapped CWCM
           Changes_When_Titlecased CWT Changes_When_Uppercased CWU Ll
           L Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC
           ID_Start IDS Letter L_ Latin Latn Lowercase_Letter Lower
           Lowercase Print Word XID_Continue XIDC XID_Start XIDS
    U+03C0 ‹π› \N{ GREEK SMALL LETTER PI }:
        \w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
        All Any Alnum Alpha Alphabetic Assigned Greek Is_Greek
           InGreek Cased Cased_Letter LC Changes_When_Casemapped CWCM
           Changes_When_Titlecased CWT Changes_When_Uppercased CWU Ll
           L Gr_Base Grapheme_Base Graph GrBase Grek Greek_And_Coptic
           ID_Continue IDC ID_Start IDS Letter L_ Lowercase_Letter
           Lower Lowercase Print Word XID_Continue XIDC XID_Start XIDS


$ uniprops -a 'MICRO SIGN'
U+00B5 ‹µ› \N{MICRO SIGN}
    \w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
    All Any Alnum Alpha Alphabetic Assigned InLatin1 Cased Cased_Letter LC Changes_When_Casefolded CWCF Changes_When_Casemapped CWCM
       Changes_When_NFKC_Casefolded CWKCF Changes_When_Titlecased CWT Changes_When_Uppercased CWU Common Zyyy Ll L Gr_Base Grapheme_Base
       Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Latin_1 Latin_1_Supplement Lowercase_Letter Lower Lowercase Print Word
       XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Lower X_POSIX_Print X_POSIX_Word
    Age=1.1 Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Latin_1 Block=Latin_1_Supplement BLK=Latin1 Canonical_Combining_Class=0
       Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Script=Common Decomposition_Type=Com
       Decomposition_Type=Compat DT=Com Decomposition_Type=Non_Canon Decomposition_Type=Non_Canonical DT=NonCanon East_Asian_Width=Neutral
       Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA
       Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=AL Line_Break=Alphabetic
       LB=AL Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 Present_In=2.1 IN=2.1
       Present_In=3.0 IN=3.0 Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0
       Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0 SC=Zyyy Script=Zyyy Sentence_Break=LO Sentence_Break=Lower SB=LO
       Word_Break=ALetter WB=LE Word_Break=LE _X_Begin

$ uniprops -a 2011
U+2011 ‹‑› \N{NON-BREAKING HYPHEN}
    \pP \p{Pd}
    All Any Assigned InGeneralPunctuation Changes_When_NFKC_Casefolded CWKCF Common Zyyy Dash Dash_Punctuation Pd P General_Punctuation
       Gr_Base Grapheme_Base Graph GrBase Punct Pat_Syn Pattern_Syntax PatSyn Print Punctuation X_POSIX_Graph X_POSIX_Print X_POSIX_Punct
    Age=1.1 Bidi_Class=ON Bidi_Class=Other_Neutral BC=ON Block=General_Punctuation Canonical_Combining_Class=0
       Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Script=Common Decomposition_Type=Nb
       Decomposition_Type=Nobreak DT=Nb Decomposition_Type=Non_Canon Decomposition_Type=Non_Canonical DT=NonCanon East_Asian_Width=Neutral
       Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA
       Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=GL Line_Break=Glue LB=GL
       Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 Present_In=2.1 IN=2.1 Present_In=3.0
       IN=3.0 Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0 Present_In=5.1
       IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0 SC=Zyyy Script=Zyyy Sentence_Break=Other SB=XX Sentence_Break=XX Word_Break=Other
       WB=XX Word_Break=XX _X_Begin

    $ uniprops -l | grep Greek | sort -dfu
    Blk=Greek
    Block:Ancient_Greek_Musical_Notation
    Block:Ancient_Greek_Numbers
    Block:Greek
    Block=Greek_And_Coptic
    Block:Greek_Extended
    Greek
    Greek_And_Coptic
    InAncientGreekMusicalNotation
    InAncientGreekNumbers
    InGreek
    InGreekExtended
    Is_Greek
    Script=Greek

You probably want to also get unichars so you can go the other way.您可能还想获得unichars ,以便您可以 go 以另一种方式。 Here are just the examples of calling it:这里只是调用它的例子:

 $ unichars -gns '\p{Cased}' '\p{Number}'
 $ unichars '\R'
 $ unichars '\S' '[\v\h]' 
 $ unichars '\S' '\p{space}'   
 $ unichars '\pL' '\p{Greek}'
 $ unichars '\pL' '\p{Greek}' | um
 $ unichars '\p{Age=6.0}'     | um
 $ unichars '\p{Lowercase}' '\P{Lowercase_Letter}' 
 $ unichars '\p{Lower}'     '\P{Ll}'  # same but easier to type
 $ unichars -a '\p{alphabetic}' '\P{Letter}' | wc -l # 1006 code points
 $ unichars -gas '\PL' '\p{Cased}'
 $ unichars -gas '\P{MARK}' '\p{diacritic}'   #  209 code points
 $ unichars -gas '\pM' '\P{BC=NSM}'
 $ unichars -gas '\p{Cased}' '[^\p{CWL}\p{CWT}\p{CWU}]'  
 $ unichars -gas '\p{Dash}'
 $ unichars -gas '\p{mark}' '\P{DIACRITIC}'   # 1068 code points
 $ unichars -gas 'grep { length > 1 } lc, ucfirst, uc'
 $ unichars -gas 'uc ne ucfirst'
 $ unichars -gasn NUM

Here is one example of the output:这是 output 的一个示例:

$ unichars -gsn NUM 'int NUM ne NUM'
‭ 0  U+0030 GC=Nd      0=NV  SC=Common       DIGIT ZERO
‭ ¼  U+00BC GC=No    1/4=NV  SC=Common       VULGAR FRACTION ONE QUARTER
‭ ½  U+00BD GC=No    1/2=NV  SC=Common       VULGAR FRACTION ONE HALF
‭ ¾  U+00BE GC=No    3/4=NV  SC=Common       VULGAR FRACTION THREE QUARTERS
‭ ٠  U+0660 GC=Nd      0=NV  SC=Common       ARABIC-INDIC DIGIT ZERO
‭ ۰  U+06F0 GC=Nd      0=NV  SC=Arabic       EXTENDED ARABIC-INDIC DIGIT ZERO
‭ ߀  U+07C0 GC=Nd      0=NV  SC=Nko          NKO DIGIT ZERO
‭ ०  U+0966 GC=Nd      0=NV  SC=Devanagari   DEVANAGARI DIGIT ZERO
‭ ০  U+09E6 GC=Nd      0=NV  SC=Bengali      BENGALI DIGIT ZERO
‭ ৴  U+09F4 GC=No   1/16=NV  SC=Bengali      BENGALI CURRENCY NUMERATOR ONE
‭ ৵  U+09F5 GC=No    1/8=NV  SC=Bengali      BENGALI CURRENCY NUMERATOR TWO
‭ ৶  U+09F6 GC=No   3/16=NV  SC=Bengali      BENGALI CURRENCY NUMERATOR THREE
‭ ৷  U+09F7 GC=No    1/4=NV  SC=Bengali      BENGALI CURRENCY NUMERATOR FOUR
‭ ৸  U+09F8 GC=No    3/4=NV  SC=Bengali      BENGALI CURRENCY NUMERATOR ONE LESS THAN THE DENOMINATOR
‭ ੦  U+0A66 GC=Nd      0=NV  SC=Gurmukhi     GURMUKHI DIGIT ZERO
‭ ૦  U+0AE6 GC=Nd      0=NV  SC=Gujarati     GUJARATI DIGIT ZERO
‭ ୦  U+0B66 GC=Nd      0=NV  SC=Oriya        ORIYA DIGIT ZERO
‭ ୲  U+0B72 GC=No    1/4=NV  SC=Oriya        ORIYA FRACTION ONE QUARTER
‭ ୳  U+0B73 GC=No    1/2=NV  SC=Oriya        ORIYA FRACTION ONE HALF
‭ ୴  U+0B74 GC=No    3/4=NV  SC=Oriya        ORIYA FRACTION THREE QUARTERS
‭ ୵  U+0B75 GC=No   1/16=NV  SC=Oriya        ORIYA FRACTION ONE SIXTEENTH
‭ ୶  U+0B76 GC=No    1/8=NV  SC=Oriya        ORIYA FRACTION ONE EIGHTH
‭ ୷  U+0B77 GC=No   3/16=NV  SC=Oriya        ORIYA FRACTION THREE SIXTEENTHS

etc.等等

I describe these the first of my OSCON Unicode talks .我将在我的OSCON Unicode 会谈中描述这些。 Those are just two of the tools in a suite of a couple of dozen of them.这些只是几十个套件中的两个工具。

There is a unicode_data.txt interface by runpaint , which works well, but describes itself as a "very early draft". runpaint 有一个unicode_data.txt 接口,效果很好,但将自己描述为“非常早期的草案”。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM