[英]Enumerate a character's Unicode properties in Ruby?
Is there any way to enumerate all of a character's Unicode properties in Ruby?有没有办法在 Ruby 中枚举一个角色的所有 Unicode 属性? I can use Ruby 1.9's Regexp class to test whether a given character has a particular property (eg, some_char =~ /\p{P}/
to test whether some_char
is punctuation, etc.)... but since characters can have multiple properties ( (
, for example, is both punctuation and ASCII, etc.), it would be nice to just be able to get a list of all of a character's properties.我可以使用 Ruby 1.9 的正则表达式 class 来测试给定字符是否具有特定属性(例如, some_char =~ /\p{P}/
来测试some_char
是否是标点符号等)...但是由于字符可以有多个属性( (
,例如,既是标点符号又是 ASCII 等),如果能够获得一个字符所有属性的列表就很好了。
I could probably do this by hand using unicode_data.txt
, or whatever it's called, but this seems like the sort of thing that's probably already been done somewhere.我可能可以使用unicode_data.txt
或其他任何名称手动执行此操作,但这似乎已经在某处完成了。 UnicodeUtils
doesn't appear to have anything along these lines, and Googling didn't turn up anything obvious. UnicodeUtils
似乎没有任何类似的东西,谷歌搜索也没有发现任何明显的东西。 Thanks!谢谢!
You can call out to my uniprops script .您可以调用我的uniprops 脚本。
$ uniprops -p delta greek:delta Greek:Delta
U+1E9F ‹ẟ› \N{ LATIN SMALL LETTER DELTA }:
\w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
U+03B4 ‹δ› \N{ GREEK SMALL LETTER DELTA }:
\w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
U+0394 ‹Δ› \N{ GREEK CAPITAL LETTER DELTA }:
\w \pL \p{LC} \p{L_} \p{L&} \p{Lu}
$ uniprops \# ç π
U+0023 ‹#› \N{ NUMBER SIGN }:
\pP \p{Po}
All Any ASCII Assigned Common Zyyy Po P Gr_Base
Grapheme_Base Graph GrBase Other_Punctuation Punct Pat_Syn
Pattern_Syntax PatSyn PosixGraph PosixPrint PosixPunct
Print Punctuation
U+00E7 ‹ç› \N{ LATIN SMALL LETTER C WITH CEDILLA }:
\w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
All Any Alnum Alpha Alphabetic Assigned InLatin1 Cased
Cased_Letter LC Changes_When_Casemapped CWCM
Changes_When_Titlecased CWT Changes_When_Uppercased CWU Ll
L Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC
ID_Start IDS Letter L_ Latin Latn Lowercase_Letter Lower
Lowercase Print Word XID_Continue XIDC XID_Start XIDS
U+03C0 ‹π› \N{ GREEK SMALL LETTER PI }:
\w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
All Any Alnum Alpha Alphabetic Assigned Greek Is_Greek
InGreek Cased Cased_Letter LC Changes_When_Casemapped CWCM
Changes_When_Titlecased CWT Changes_When_Uppercased CWU Ll
L Gr_Base Grapheme_Base Graph GrBase Grek Greek_And_Coptic
ID_Continue IDC ID_Start IDS Letter L_ Lowercase_Letter
Lower Lowercase Print Word XID_Continue XIDC XID_Start XIDS
$ uniprops -a 'MICRO SIGN'
U+00B5 ‹µ› \N{MICRO SIGN}
\w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
All Any Alnum Alpha Alphabetic Assigned InLatin1 Cased Cased_Letter LC Changes_When_Casefolded CWCF Changes_When_Casemapped CWCM
Changes_When_NFKC_Casefolded CWKCF Changes_When_Titlecased CWT Changes_When_Uppercased CWU Common Zyyy Ll L Gr_Base Grapheme_Base
Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Latin_1 Latin_1_Supplement Lowercase_Letter Lower Lowercase Print Word
XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Lower X_POSIX_Print X_POSIX_Word
Age=1.1 Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Latin_1 Block=Latin_1_Supplement BLK=Latin1 Canonical_Combining_Class=0
Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Script=Common Decomposition_Type=Com
Decomposition_Type=Compat DT=Com Decomposition_Type=Non_Canon Decomposition_Type=Non_Canonical DT=NonCanon East_Asian_Width=Neutral
Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA
Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=AL Line_Break=Alphabetic
LB=AL Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 Present_In=2.1 IN=2.1
Present_In=3.0 IN=3.0 Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0
Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0 SC=Zyyy Script=Zyyy Sentence_Break=LO Sentence_Break=Lower SB=LO
Word_Break=ALetter WB=LE Word_Break=LE _X_Begin
$ uniprops -a 2011
U+2011 ‹‑› \N{NON-BREAKING HYPHEN}
\pP \p{Pd}
All Any Assigned InGeneralPunctuation Changes_When_NFKC_Casefolded CWKCF Common Zyyy Dash Dash_Punctuation Pd P General_Punctuation
Gr_Base Grapheme_Base Graph GrBase Punct Pat_Syn Pattern_Syntax PatSyn Print Punctuation X_POSIX_Graph X_POSIX_Print X_POSIX_Punct
Age=1.1 Bidi_Class=ON Bidi_Class=Other_Neutral BC=ON Block=General_Punctuation Canonical_Combining_Class=0
Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Script=Common Decomposition_Type=Nb
Decomposition_Type=Nobreak DT=Nb Decomposition_Type=Non_Canon Decomposition_Type=Non_Canonical DT=NonCanon East_Asian_Width=Neutral
Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA
Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=GL Line_Break=Glue LB=GL
Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 Present_In=2.1 IN=2.1 Present_In=3.0
IN=3.0 Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0 Present_In=5.1
IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0 SC=Zyyy Script=Zyyy Sentence_Break=Other SB=XX Sentence_Break=XX Word_Break=Other
WB=XX Word_Break=XX _X_Begin
$ uniprops -l | grep Greek | sort -dfu
Blk=Greek
Block:Ancient_Greek_Musical_Notation
Block:Ancient_Greek_Numbers
Block:Greek
Block=Greek_And_Coptic
Block:Greek_Extended
Greek
Greek_And_Coptic
InAncientGreekMusicalNotation
InAncientGreekNumbers
InGreek
InGreekExtended
Is_Greek
Script=Greek
You probably want to also get unichars so you can go the other way.您可能还想获得unichars ,以便您可以 go 以另一种方式。 Here are just the examples of calling it:这里只是调用它的例子:
$ unichars -gns '\p{Cased}' '\p{Number}'
$ unichars '\R'
$ unichars '\S' '[\v\h]'
$ unichars '\S' '\p{space}'
$ unichars '\pL' '\p{Greek}'
$ unichars '\pL' '\p{Greek}' | um
$ unichars '\p{Age=6.0}' | um
$ unichars '\p{Lowercase}' '\P{Lowercase_Letter}'
$ unichars '\p{Lower}' '\P{Ll}' # same but easier to type
$ unichars -a '\p{alphabetic}' '\P{Letter}' | wc -l # 1006 code points
$ unichars -gas '\PL' '\p{Cased}'
$ unichars -gas '\P{MARK}' '\p{diacritic}' # 209 code points
$ unichars -gas '\pM' '\P{BC=NSM}'
$ unichars -gas '\p{Cased}' '[^\p{CWL}\p{CWT}\p{CWU}]'
$ unichars -gas '\p{Dash}'
$ unichars -gas '\p{mark}' '\P{DIACRITIC}' # 1068 code points
$ unichars -gas 'grep { length > 1 } lc, ucfirst, uc'
$ unichars -gas 'uc ne ucfirst'
$ unichars -gasn NUM
Here is one example of the output:这是 output 的一个示例:
$ unichars -gsn NUM 'int NUM ne NUM'
0 U+0030 GC=Nd 0=NV SC=Common DIGIT ZERO
¼ U+00BC GC=No 1/4=NV SC=Common VULGAR FRACTION ONE QUARTER
½ U+00BD GC=No 1/2=NV SC=Common VULGAR FRACTION ONE HALF
¾ U+00BE GC=No 3/4=NV SC=Common VULGAR FRACTION THREE QUARTERS
٠ U+0660 GC=Nd 0=NV SC=Common ARABIC-INDIC DIGIT ZERO
۰ U+06F0 GC=Nd 0=NV SC=Arabic EXTENDED ARABIC-INDIC DIGIT ZERO
߀ U+07C0 GC=Nd 0=NV SC=Nko NKO DIGIT ZERO
० U+0966 GC=Nd 0=NV SC=Devanagari DEVANAGARI DIGIT ZERO
০ U+09E6 GC=Nd 0=NV SC=Bengali BENGALI DIGIT ZERO
৴ U+09F4 GC=No 1/16=NV SC=Bengali BENGALI CURRENCY NUMERATOR ONE
৵ U+09F5 GC=No 1/8=NV SC=Bengali BENGALI CURRENCY NUMERATOR TWO
৶ U+09F6 GC=No 3/16=NV SC=Bengali BENGALI CURRENCY NUMERATOR THREE
৷ U+09F7 GC=No 1/4=NV SC=Bengali BENGALI CURRENCY NUMERATOR FOUR
৸ U+09F8 GC=No 3/4=NV SC=Bengali BENGALI CURRENCY NUMERATOR ONE LESS THAN THE DENOMINATOR
੦ U+0A66 GC=Nd 0=NV SC=Gurmukhi GURMUKHI DIGIT ZERO
૦ U+0AE6 GC=Nd 0=NV SC=Gujarati GUJARATI DIGIT ZERO
୦ U+0B66 GC=Nd 0=NV SC=Oriya ORIYA DIGIT ZERO
୲ U+0B72 GC=No 1/4=NV SC=Oriya ORIYA FRACTION ONE QUARTER
୳ U+0B73 GC=No 1/2=NV SC=Oriya ORIYA FRACTION ONE HALF
୴ U+0B74 GC=No 3/4=NV SC=Oriya ORIYA FRACTION THREE QUARTERS
୵ U+0B75 GC=No 1/16=NV SC=Oriya ORIYA FRACTION ONE SIXTEENTH
୶ U+0B76 GC=No 1/8=NV SC=Oriya ORIYA FRACTION ONE EIGHTH
୷ U+0B77 GC=No 3/16=NV SC=Oriya ORIYA FRACTION THREE SIXTEENTHS
etc.等等
I describe these the first of my OSCON Unicode talks .我将在我的OSCON Unicode 会谈中描述这些。 Those are just two of the tools in a suite of a couple of dozen of them.这些只是几十个套件中的两个工具。
There is a unicode_data.txt interface by runpaint , which works well, but describes itself as a "very early draft". runpaint 有一个unicode_data.txt 接口,效果很好,但将自己描述为“非常早期的草案”。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.