简体   繁体   English

在Python中用正则表达式替换

[英]Substitution by regular expression in Python

Consider the Python snippet: 考虑Python片段:

import re
str = 'that that kitty is cute'

# Anchor at beginning of string
rgexp_start = r'^(.*) \1'
print(re.sub(rgexp_start, r'\1', str))    

# Do NOT anchor at beginning of string
rgexp = r'(.*) \1'
print(re.sub(rgexp, r'\1', str))   

This prints: 这打印:

that kitty is cute
thatkittyiscute

Why does the second regular expression remove all spaces? 为什么第二个正则表达式会删除所有空格? As an additional question, consider the JavaScript snippet: 还有一个问题,请考虑JavaScript代码段:

var str = 'that that kitty is cute';
var rgexp_start = /^(.*) \1/;
alert(str.replace(rgexp_start, '$1'));

var rgexp = /(.*) \1/;
alert(str.replace(rgexp, '$1'));

Which gives twice: 哪两次给出:

that kitty is cute

Why is it that JavaScript differs from Python in the handling of the very same regular expression ? 为什么JavaScript在处理相同的正则表达式时与Python不同?

To answer your first question, re.sub will substitute exactly the pattern you pass. 要回答您的第一个问题, re.sub完全替换您传递的模式。

So, r'^(.*) \\1' means, replace all duplicates that start at the beginning . 因此, r'^(.*) \\1'表示替换从头开始的所有重复项。 Since you have specified that the match start from the beginning, and since there is only one beginning to the string, the only thing that can be found to be matched and replaced is '^that that' , and so it is done. 既然你已经指定匹配从头开始,并且因为字符串只有一个开头,那么唯一能找到匹配和替换的东西就是'^that that' ,所以就完成了。

In[]: 'that that kitty is cute'

'^that that' -> 'that'

Out[]: 'that kitty is cute'

In the case of r'(.*) \\1' , .* can actually match 0 or more characters . r'(.*) \\1'.*实际上可以匹配0个或更多个字符 This is important, since now the regex isn't bound to the start anymore. 这很重要,因为现在正则表达式不再受限于开始了。 So what it does is, in addition to '^that that ' (which the first regex also did), it matches '' , then the space, then '' again, a total of 3 times. 所以它的作用是,除了'^that that '(第一个正则表达式也做)之外,它匹配'' ,然后是空格,然后''再次匹配,总共3次。 So, it will substitute ' ' (a space with a '' (empty string) on either side) with '' . 因此,它将替代' ' (带有空格''与(空字符串两侧)) ''

In[]: 'that that kitty is cute'

'that that' -> 'that'
' '         -> ''
' '         -> ''
' '         -> ''

Out[]: 'thatkittyiscute'

To answer your second question, the difference b/w python and JS, as explained by anubhava is that the global flag in JS is not enabled by default; 要回答你的第二个问题,b / w python和JS的区别,正如anubhava所解释的那样,默认情况下不启用JS中的全局标志; only the first replacement occurs, leaving the rest of the string untouched. 只发生第一次更换,保持字符串的其余部分不变。

Javascript behavior is different because you have not turned on global or g flag in Javascript regex (that is turned on by default in python). Javascript行为是不同的,因为您没有在Javascript正则表达式中打开globalg标志(默认情况下在python中打开)。

If you use same regex with g flag as: 如果你使用与g标志相同的正则表达式:

var rgexp = /(.*) \1/g;
console.log(str.replace(rgexp, '$1'));

Then it will print: 然后它将打印:

thatkittyiscute

Which is same behavior as python . 这与python行为相同。

btw if you use this slightly different regex: 顺便说一下,如果你使用这个略有不同的正则表达式:

(\S+) \1

Then it will always print this after replacement even without anchor as in your first example: 然后,即使没有锚,它也会在更换后始终打印出来,如第一个例子中所示:

that kitty is cute

\\S+ matches one or more of a non-white-space character. \\S+匹配一个或多个非空白字符。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM