I looked into how tokenization is implemented in scikit-learn and found this regex ( source ):
token_pattern = r"(?u)\b\w\w+\b"
The regex is pretty straightforward but I have never seen the (?u)
part before. Can someone explain me what this part is doing?
It switches on the re.U
( re.UNICODE
) flag for this expression.
From the module documentation :
(?iLmsux)
(One or more letters from the set
'i'
,'L'
,'m'
,'s'
,'u'
,'x'
.) The group matches the empty string; the letters set the corresponding flags:re.I
(ignore case),re.L
(locale dependent),re.M
(multi-line),re.S
(dot matches all),re.U
(Unicode dependent), andre.X
(verbose), for the entire regular expression. (The flags are described in Module Contents.) This is useful if you wish to include the flags as part of the regular expression, instead of passing a flag argument to there.compile()
function.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.