简体   繁体   中英

Python fastest way to remove multiple spaces in a string

This question has been asked before, but the fast answers that I have seen also remove the trailing spaces, which I don't want.

"   a     bc    "

should become

" a bc "

I have

text = re.sub(' +', " ", text)

but am hoping for something faster. The suggestion that I have seen (and which won't work) is

' '.join(text.split())

Note that I will be doing this to lots of smaller texts so just checking for a trailing space won't be so great.

If you want to really optimize stuff like this, use C, not python.

Try cython, that is pretty much Python syntax but fast as C.

Here is some stuff you can time:

import array
buf=array.array('c')
input="   a     bc    "
space=False
for c in input:
  if not space or not c == ' ': buf.append(c)
  space = (c == ' ')
buf.tostring()

Also try using cStringIO :

import cStringIO
buf=cStringIO.StringIO()
input="   a     bc    "
space=False
for c in input:
  if not space or not c == ' ': buf.write(c)
  space = (c == ' ')
buf.getvalue()

But again, if you want to make such things really fast, don't do it in python. Use cython . The two approaches I gave here will likely be slower, just because they put much more work on the python interpreter. If you want these things to be fast, do as little as possible in python. The for c in input loop likely already kills all theoretical performance of above approaches.

FWIW, some timings

$  python -m timeit -s 's="   a     bc    "' 't=s[:]' "while '  ' in t: t=t.replace('  ', ' ')"
1000000 loops, best of 3: 1.05 usec per loop

$ python -m timeit -s 'import re;s="   a     bc    "'  "re.sub(' +', ' ', s)"
100000 loops, best of 3: 2.27 usec per loop

$ python -m timeit -s 's=" a bc "' "''.join((s[0],' '.join(s[1:-1].split()),s[-1]))"
1000000 loops, best of 3: 0.592 usec per loop

$ python -m timeit -s 'import re;s="   a     bc    "'  "re.sub(' {2,}', ' ', s)"
100000 loops, best of 3: 2.34 usec per loop

$ python -m timeit -s 's="   a     bc    "' '" "+" ".join(s.split())+" "'
1000000 loops, best of 3: 0.387 usec per loop

Just a small rewrite of the suggestion up there, but just because something has a small fault doesn't mean you should assume it won't work.

You could easily do something like:

front_space = lambda x:x[0]==" "
trailing_space = lambda x:x[-1]==" "
" "*front_space(text)+' '.join(text.split())+" "*trailing_space(text)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM