简体   繁体   中英

What does “(?u)” do in a regex?

I looked into how tokenization is implemented in scikit-learn and found this regex ( source ):

token_pattern = r"(?u)\b\w\w+\b"

The regex is pretty straightforward but I have never seen the (?u) part before. Can someone explain me what this part is doing?

It switches on the re.U ( re.UNICODE ) flag for this expression.

From the module documentation :

(?iLmsux)

(One or more letters from the set 'i' , 'L' , 'm' , 's' , 'u' , 'x' .) The group matches the empty string; the letters set the corresponding flags: re.I (ignore case), re.L (locale dependent), re.M (multi-line), re.S (dot matches all), re.U (Unicode dependent), and re.X (verbose), for the entire regular expression. (The flags are described in Module Contents.) This is useful if you wish to include the flags as part of the regular expression, instead of passing a flag argument to the re.compile() function.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM