I have quite large amount of text which include control charachters like \\n \\t and \\r. I need to replace them with a simple space--> " ". What is the fastest way to do this? Thanks
I think the fastest way is to use str.translate()
:
import string
s = "a\nb\rc\td"
print s.translate(string.maketrans("\n\t\r", " "))
prints
a b c d
EDIT : As this once again turned into a discussion about performance, here some numbers. For long strings, translate()
is way faster than using regular expressions:
s = "a\nb\rc\td " * 1250000
regex = re.compile(r'[\n\r\t]')
%timeit t = regex.sub(" ", s)
# 1 loops, best of 3: 1.19 s per loop
table = string.maketrans("\n\t\r", " ")
%timeit s.translate(table)
# 10 loops, best of 3: 29.3 ms per loop
That's about a factor 40.
You may also try regular expressions:
import re
regex = re.compile(r'[\n\r\t]')
regex.sub(' ', my_str)
>>> re.sub(r'[\t\n\r]', ' ', '1\n2\r3\t4')
'1 2 3 4'
If you want to normalise whitespace (replace runs of one or more whitespace characters by a single space, and strip leading and trailing whitespace) this can be accomplished by using string methods:
>>> text = ' foo\tbar\r\nFred Nurke\t Joe Smith\n\n'
>>> ' '.join(text.split())
'foo bar Fred Nurke Joe Smith'
using regex
re.sub(r'\s+', ' ', '1\n2\r3\t4')
without regex
>>> ' '.join('1\n\n2\r3\t4'.split())
'1 2 3 4'
>>>
my_string
is the string where you want to delete specific control characters. As strings are immutable in python, after substitute operation you need to assign it to another string or reassign it:
my_string = re.sub(r'[\n\r\t]*', '', my_string)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.