简体   繁体   中英

Matching Unicode word boundaries in Python

In order to match the Unicode word boundaries [as defined in the Annex #29 ] in Python, I have been using the regex package with flags regex.WORD | regex.V1 regex.WORD | regex.V1 ( regex.UNICODE should be default since the pattern is a Unicode string) in the following way:

>>> s="here are some words"
>>> regex.findall(r'\w(?:\B\S)*', s, flags = regex.V1 | regex.WORD)
['here', 'are', 'some', 'words']

It works well in this rather simple cases. However, I was wondering what is the expected behavior in case the input string contains certain punctuation. It seems to me that WB7 says that for example the apostrophe in x'z does not qualify as a word boundary which seems to be indeed the case:

>>> regex.findall(r'\w(?:\B\S)*', "x'z", flags = regex.V1 | regex.WORD)
["x'z"]

However, if there is a vowel, the situation changes:

>>> regex.findall(r'\w(?:\B\S)*', "l'avion", flags = regex.V1 | regex.WORD)
["l'", 'avion']

This would suggest that the regex module implements the rule WB5a mentioned in the standard in the Notes section. However, this rule also says that the behavior should be the same with \’ (right single quotation mark) which I can't reproduce:

>>> regex.findall(r'\w(?:\B\S)*', "l\u2019avion", flags = regex.V1 | regex.WORD)
['l’avion']

Moreover, even with "normal" apostrophe, a ligature (or y ) seems to behave as a "non-vowel":

>>> regex.findall(r'\w(?:\B\S)*', "l'œil", flags = regex.V1 | regex.WORD)
["l'œil"]
>>> regex.findall(r'\w(?:\B\S)*', "J'y suis", flags = regex.V1 | regex.WORD)
["J'y", 'suis']

Is this the expected behavior? (all examples above were executed with regex 2.4.106 and Python 3.5.2)

1- RIGHT SINGLE QUOTATION MARK ' seems to be just simply missed in source file :

/* Break between apostrophe and vowels (French, Italian). */
/* WB5a */
if (pos_m1 >= 0 && char_at(state->text, pos_m1) == '\'' &&
  is_unicode_vowel(char_at(state->text, text_pos)))
    return TRUE;

2- Unicode vowels are determined with is_unicode_vowel() function which translates to this list:

a, à, á, â, e, è, é, ê, i, ì, í, î, o, ò, ó, ô, u, ù, ú, û

So a LATIN SMALL LIGATURE OE œ character is not considered as a unicode vowel:

Py_LOCAL_INLINE(BOOL) is_unicode_vowel(Py_UCS4 ch) {
#if PY_VERSION_HEX >= 0x03030000
    switch (Py_UNICODE_TOLOWER(ch)) {
#else
    switch (Py_UNICODE_TOLOWER((Py_UNICODE)ch)) {
#endif
    case 'a': case 0xE0: case 0xE1: case 0xE2:
    case 'e': case 0xE8: case 0xE9: case 0xEA:
    case 'i': case 0xEC: case 0xED: case 0xEE:
    case 'o': case 0xF2: case 0xF3: case 0xF4:
    case 'u': case 0xF9: case 0xFA: case 0xFB:
        return TRUE;
    default:
        return FALSE;
    }
}

This bug is now fixed in regex 2016.08.27 after a bug report . [ _regex.c:#1668 ]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM