简体   繁体   中英

What does ($pre =~ /\./ && $pre =~ /\p{IsAlpha}/) mean in the Moses Tokenizer?

Moses Tokenizer is the tokenizer widely used in machine translation and natural language processing experiments.

There is a line of regex that checks for:

if (($pre =~ /\./ && $pre =~ /\p{IsAlpha}/) || 
   ($NONBREAKING_PREFIX{$pre} && $NONBREAKING_PREFIX{$pre}==1) || 
   ($i<scalar(@words)-1 && ($words[$i+1] =~ /^[\p{IsLower}]/)))

Please correct me if I'm wrong, the 2nd and 3rd conditions are to check

  • whether the prefix is in a list of nonbreaking prefixes
  • whether the word is not the last token and there is still a lowercased token as the next word.

The question is on the first condition where it checks for:

($pre =~ /\./ && $pre =~ /\p{IsAlpha}/)
  1. Is the $pre =~ /\\./ checking whether the prefix is a single fullstop?

  2. And is $pre =~ /\\p{IsAlpha}/ checking whether the prefix is an alpha from the list of alphabet in the perluniprop ?

  3. One related question is whether the fullstop is already inside the perluniprop alphabet? If so, wouldn't this condition never be true?

Please correct me if I'm wrong [about $NONBREAKING_PREFIX{$pre} && $NONBREAKING_PREFIX{$pre}==1 checking] whether the prefix is in a list of nonbreaking prefixes

Can't tell without knowing what %NONBREAKING_PREFIX contains, but it's a fair guess.

Please correct me if I'm wrong [about $i<scalar(@words)-1 && ($words[$i+1] =~ /^[\\p{IsLower}]/) checking] whether the word is not the last token and there is still a lowercased token as the next word

Assuming the code is iterating over @words , and $i is the index of the current word, then it checks if the current word is followed by a word that starts with a lowercase letter (as defined by Unicode).

Is the $pre =~ /\\./ checking whether the prefix is a single fullstop?

Not quite. It checks if any of the characters in the string in $pre is a FULL STOP.

$ perl -e'CORE::say "abc.def" =~ /\./ ? "match" : "no match"'
match

$ perl -e'CORE::say "abc!def" =~ /\./ ? "match" : "no match"'
no match

Perl first tries to find a match at position 0, then at position 1, etc, until it finds a match.

And is $pre =~ /\\p{IsAlpha}/ checking whether the prefix is an alpha from the list of alphabet in the perluniprop?

\\p{IsAlpha} is indeed defined in perluniprops . [Note the correct spelling.] It defines

\p{Is_*}          ⇒   \p{*}
\p{Alpha}         ⇒   \p{XPosixAlpha}
\p{XPosixAlpha}   ⇒   \p{Alphabetic=Y}

\p{Alpha: *}      ⇒   \p{Alphabetic=*}
\p{Alphabetic}    ⇒   \p{Alphabetic=Y}

so \\p{IsAlpha} is an alias for \\p{Alphabetic=Y} [1] . Unicode defines what characters are Alphabetic [2] . There are quite a few:

$ unichars '\p{Alpha}' | wc -l
10391

So back to the question. $pre =~ /\\p{IsAlpha}/ checks if any of the characters in the string in $pre is an alphabetic character.

One related question is whether the fullstop is already inside the perluniprop alphabet?

No.

$ perl -e'CORE::say "." =~ /\p{IsAlpha}/ ? "match" : "no match"'
no match

$ uniprops .
U+002E <.> \N{FULL STOP}
    \pP \p{Po}
    All Any ASCII Assigned Basic_Latin Punct Is_Punctuation Case_Ignorable CI Common Zyyy Po P
       Gr_Base Grapheme_Base Graph X_POSIX_Graph GrBase Other_Punctuation Pat_Syn Pattern_Syntax
       PatSyn POSIX_Graph POSIX_Print POSIX_Punct Print X_POSIX_Print Punctuation STerm Term
       Terminal_Punctuation Unicode X_POSIX_Punct

In contrast,

$ uniprops a
U+0061 <a> \N{LATIN SMALL LETTER A}
    \w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
    AHex POSIX_XDigit All Alnum X_POSIX_Alnum Alpha X_POSIX_Alpha Alphabetic Any ASCII
       ASCII_Hex_Digit Assigned Basic_Latin ID_Continue Is_IDC Cased Cased_Letter LC
       Changes_When_Casemapped CWCM Changes_When_Titlecased CWT Changes_When_Uppercased CWU Ll L
       Gr_Base Grapheme_Base Graph X_POSIX_Graph GrBase Hex X_POSIX_XDigit Hex_Digit IDC ID_Start
       IDS Letter L_ Latin Latn Lowercase_Letter Lower X_POSIX_Lower Lowercase PerlWord POSIX_Word
       POSIX_Alnum POSIX_Alpha POSIX_Graph POSIX_Lower POSIX_Print Print X_POSIX_Print Unicode Word
       X_POSIX_Word XDigit XID_Continue XIDC XID_Start XIDS

If so, wouldn't this condition never be true?

$ perl -E'CORE::say /\./ && /\p{IsAlpha}/ ? "match" : "no match" for $ARGV[0]' a
no match

$ perl -E'CORE::say /\./ && /\p{IsAlpha}/ ? "match" : "no match" for $ARGV[0]' .
no match

$ perl -E'CORE::say /\./ && /\p{IsAlpha}/ ? "match" : "no match" for $ARGV[0]' a.
match

  1. Underscores and spaces are ignored, so \\p{IsAlpha} , \\p{Is_Alpha} and \\p{I s_A l p_h_a} are all equivalent.

  2. The list of alphabetic characters is slightly different than the list of letter characters.

     $ unichars '\\p{Letter}' | wc -l 9540 $ unichars '\\p{Alpha}' | wc -l 10391

    All letters are alphabetic, but so are some alphabetic marks, roman numerals, etc.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM