Replacing whole string is faster than replacing only its first character

Question

I tried to replace a character a by b in a given large string. I did an experiment - first I replaced it in the whole string, then I replaced it only at its beginning.

import re
# pattern = re.compile('a')
pattern = re.compile('^a')
string = 'x' * 100000

pattern.sub('b', string)

I expected that replacing the beginning would have to be much faster then replacing the whole string because you have to check only 1 position instead of 100000. I did some measuring:

python -m timeit --setup "import re; p=re.compile('a'); string='x'*100000" "p.sub('b', string)"
10000 loops, best of 3: 19.1 usec per loop

python -m timeit --setup "import re; p=re.compile('^a'); string='x'*100000" "p.sub('b', string)"
1000 loops, best of 3: 613 usec per loop

The results show that, on the contrary, trying to replace the whole string is about 30x faster. Would you expect such result? Can you explain that?

Answer 1

The functions provided in the Python re module do not optimize based on the pattern. In particular, functions that try to apply a regex at every position - .search , .sub , .findall etc. - will do so even when the regex can only possibly match at the beginning. Ie, even without multi-line mode specified, such that ^ can only match at the beginning of the string, the call is not re-routed internally. Thus:

$ # .match only looks at the first position regardless
$ python -m timeit --setup "import re; p=re.compile('a'); string='x'*100000" "p.match(string)"
2000000 loops, best of 5: 155 nsec per loop
$ python -m timeit --setup "import re; p=re.compile('^a'); string='x'*100000" "p.match(string)"
2000000 loops, best of 5: 157 nsec per loop
$ # .search looks at every position, even if there is an anchor
$ python -m timeit --setup "import re; p=re.compile('a'); string='x'*100000" "p.search(string)"
10000 loops, best of 5: 22.4 usec per loop
$ # and the anchor only adds complexity to the matching process
$ python -m timeit --setup "import re; p=re.compile('^a'); string='x'*100000" "p.search(string)"
500 loops, best of 5: 746 usec per loop

In short, your code with .sub must look at every position because that is what .sub is defined to do , even though it's obviously silly here.

I couldn't tell you why the anchor makes the search that much slower, nor why the worst-case slowdown is still much less than implied by the string length. Regardless, the practical advice is to use a .match -based variant where performance matters.

I think a reasonable case can be made to file a bug report against this - not having such an obvious optimization implemented clearly violates expectations. Aside from which, while it's easy to replace .search with an anchor using .match , it's not so straightforward to replace .sub with an anchor - you have to .match , check the result, and then call .replace on the string yourself.

If you need to anchor to the end of the string and not the start, it gets much more difficult; I recall ancient Perl advice to try reversing the string first, but it's hard in general to write a pattern that matches the reverse of what you want.

Replacing whole string is faster than replacing only its first character

Question

1 answers

solution1
3 2022-01-31 15:04:56

Replacing whole string is faster than replacing only its first character

Question

1 answers

solution1 3 2022-01-31 15:04:56

solution1
3 2022-01-31 15:04:56