What does “(?u)” do in a regex?

Question

I looked into how tokenization is implemented in scikit-learn and found this regex ( source ):

token_pattern = r"(?u)\b\w\w+\b"

The regex is pretty straightforward but I have never seen the (?u) part before. Can someone explain me what this part is doing?

Answer 1

It switches on the re.U ( re.UNICODE ) flag for this expression.

From the module documentation :

(?iLmsux)

(One or more letters from the set 'i' , 'L' , 'm' , 's' , 'u' , 'x' .) The group matches the empty string; the letters set the corresponding flags: re.I (ignore case), re.L (locale dependent), re.M (multi-line), re.S (dot matches all), re.U (Unicode dependent), and re.X (verbose), for the entire regular expression. (The flags are described in Module Contents.) This is useful if you wish to include the flags as part of the regular expression, instead of passing a flag argument to the re.compile() function.

What does “(?u)” do in a regex?

Question

1 answers

solution1
18 ACCPTED 2016-01-27 16:41:18

What does “(?u)” do in a regex?

Question

1 answers

solution1 18 ACCPTED 2016-01-27 16:41:18

solution1
18 ACCPTED 2016-01-27 16:41:18