简体   繁体   中英

Substitution by regular expression in Python

Consider the Python snippet:

import re
str = 'that that kitty is cute'

# Anchor at beginning of string
rgexp_start = r'^(.*) \1'
print(re.sub(rgexp_start, r'\1', str))    

# Do NOT anchor at beginning of string
rgexp = r'(.*) \1'
print(re.sub(rgexp, r'\1', str))   

This prints:

that kitty is cute
thatkittyiscute

Why does the second regular expression remove all spaces? As an additional question, consider the JavaScript snippet:

var str = 'that that kitty is cute';
var rgexp_start = /^(.*) \1/;
alert(str.replace(rgexp_start, '$1'));

var rgexp = /(.*) \1/;
alert(str.replace(rgexp, '$1'));

Which gives twice:

that kitty is cute

Why is it that JavaScript differs from Python in the handling of the very same regular expression ?

To answer your first question, re.sub will substitute exactly the pattern you pass.

So, r'^(.*) \\1' means, replace all duplicates that start at the beginning . Since you have specified that the match start from the beginning, and since there is only one beginning to the string, the only thing that can be found to be matched and replaced is '^that that' , and so it is done.

In[]: 'that that kitty is cute'

'^that that' -> 'that'

Out[]: 'that kitty is cute'

In the case of r'(.*) \\1' , .* can actually match 0 or more characters . This is important, since now the regex isn't bound to the start anymore. So what it does is, in addition to '^that that ' (which the first regex also did), it matches '' , then the space, then '' again, a total of 3 times. So, it will substitute ' ' (a space with a '' (empty string) on either side) with '' .

In[]: 'that that kitty is cute'

'that that' -> 'that'
' '         -> ''
' '         -> ''
' '         -> ''

Out[]: 'thatkittyiscute'

To answer your second question, the difference b/w python and JS, as explained by anubhava is that the global flag in JS is not enabled by default; only the first replacement occurs, leaving the rest of the string untouched.

Javascript behavior is different because you have not turned on global or g flag in Javascript regex (that is turned on by default in python).

If you use same regex with g flag as:

var rgexp = /(.*) \1/g;
console.log(str.replace(rgexp, '$1'));

Then it will print:

thatkittyiscute

Which is same behavior as python .

btw if you use this slightly different regex:

(\S+) \1

Then it will always print this after replacement even without anchor as in your first example:

that kitty is cute

\\S+ matches one or more of a non-white-space character.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM