简体   繁体   English

在Python中匹配Unicode字边界

[英]Matching Unicode word boundaries in Python

In order to match the Unicode word boundaries [as defined in the Annex #29 ] in Python, I have been using the regex package with flags regex.WORD | regex.V1 为了匹配Python中的Unicode字边界[如附件#29中所定义],我一直在使用带有标志regex.WORD | regex.V1regexregex.WORD | regex.V1 regex.WORD | regex.V1 ( regex.UNICODE should be default since the pattern is a Unicode string) in the following way: regex.WORD | regex.V1regex.UNICODE应该是默认的,因为模式是Unicode字符串),方式如下:

>>> s="here are some words"
>>> regex.findall(r'\w(?:\B\S)*', s, flags = regex.V1 | regex.WORD)
['here', 'are', 'some', 'words']

It works well in this rather simple cases. 它在这个相当简单的情况下运行良好。 However, I was wondering what is the expected behavior in case the input string contains certain punctuation. 但是,我想知道输入字符串包含某些标点符号时的预期行为是什么。 It seems to me that WB7 says that for example the apostrophe in x'z does not qualify as a word boundary which seems to be indeed the case: 在我看来, WB7说,例如x'z中的撇号不符合词边界,这似乎确实如此:

>>> regex.findall(r'\w(?:\B\S)*', "x'z", flags = regex.V1 | regex.WORD)
["x'z"]

However, if there is a vowel, the situation changes: 但是,如果有元音,情况会发生变化:

>>> regex.findall(r'\w(?:\B\S)*', "l'avion", flags = regex.V1 | regex.WORD)
["l'", 'avion']

This would suggest that the regex module implements the rule WB5a mentioned in the standard in the Notes section. 这表明正则表达式模块实现了Notes部分标准中提到的规则WB5a However, this rule also says that the behavior should be the same with \’ (right single quotation mark) which I can't reproduce: 但是,这个规则还说,行为应该与\’ (右单引号)相同,我无法重现:

>>> regex.findall(r'\w(?:\B\S)*', "l\u2019avion", flags = regex.V1 | regex.WORD)
['l’avion']

Moreover, even with "normal" apostrophe, a ligature (or y ) seems to behave as a "non-vowel": 而且,即使使用“普通”撇号,连字符(或y )似乎表现为“非元音”:

>>> regex.findall(r'\w(?:\B\S)*', "l'œil", flags = regex.V1 | regex.WORD)
["l'œil"]
>>> regex.findall(r'\w(?:\B\S)*', "J'y suis", flags = regex.V1 | regex.WORD)
["J'y", 'suis']

Is this the expected behavior? 这是预期的行为吗? (all examples above were executed with regex 2.4.106 and Python 3.5.2) (以上所有示例均使用regex 2.4.106和Python 3.5.2执行)

1- RIGHT SINGLE QUOTATION MARK ' seems to be just simply missed in source file : 1- 右单引号'似乎只是在源文件中错过:

/* Break between apostrophe and vowels (French, Italian). */
/* WB5a */
if (pos_m1 >= 0 && char_at(state->text, pos_m1) == '\'' &&
  is_unicode_vowel(char_at(state->text, text_pos)))
    return TRUE;

2- Unicode vowels are determined with is_unicode_vowel() function which translates to this list: 2- Unicode元音由is_unicode_vowel()函数确定, is_unicode_vowel()函数转换为此列表:

a, à, á, â, e, è, é, ê, i, ì, í, î, o, ò, ó, ô, u, ù, ú, û

So a LATIN SMALL LIGATURE OE œ character is not considered as a unicode vowel: 所以LATIN SMALL LIGATURE OE œ字符不被视为unicode元音:

Py_LOCAL_INLINE(BOOL) is_unicode_vowel(Py_UCS4 ch) {
#if PY_VERSION_HEX >= 0x03030000
    switch (Py_UNICODE_TOLOWER(ch)) {
#else
    switch (Py_UNICODE_TOLOWER((Py_UNICODE)ch)) {
#endif
    case 'a': case 0xE0: case 0xE1: case 0xE2:
    case 'e': case 0xE8: case 0xE9: case 0xEA:
    case 'i': case 0xEC: case 0xED: case 0xEE:
    case 'o': case 0xF2: case 0xF3: case 0xF4:
    case 'u': case 0xF9: case 0xFA: case 0xFB:
        return TRUE;
    default:
        return FALSE;
    }
}

This bug is now fixed in regex 2016.08.27 after a bug report . 现在,在错误报告后,此错误已在regex 2016.08.27中修复 [ _regex.c:#1668 ] [ _regex.c:#1668 ]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM