Hi In need to remove all special characters, punctuation and spaces from a string so that I only have letters and numbers. The length of the final string should be only first 200 characters.
I know of a solution which is :-
string = "Special $#! character's spaces 888323"
string = ''.join(e for e in string if e.isalnum())[:200]
But this will first remove all the unwanted characters and then slice it. Is there something that will work like a generator, ie as soon as total characters are 200, it should break. I want a pythonic solution. PS : I know I can achieve it via FOR loops.
from itertools import islice
"".join(islice((e for e in string if e.isalnum()), 200))
But personally, I think the for loop sounds a lot better to me.
Use a generator expression or function with itertools.islice
:
from itertools import islice
s = "Special $#! character's spaces 888323"
gen = (e for e in s if e.isalnum())
new_s = ''.join(islice(gen, 200))
Note that if the strings are not huge and the number n
(200 here) is not small compared to string length then you should use str.translate
with simple slicing as it is going to be very fast compared to a Python based for-loop:
>>> from string import whitespace, punctuation
>>> s.translate(None, whitespace+punctuation)[:10]
'Specialcha'
Some timing comparisons for a large string:
>>> s = "Special $#! character's spaces 888323" * 10000
>>> len(s)
390000
# For very small n
>>> %timeit ''.join(islice((e for e in s if e.isalnum()), 200))
10000 loops, best of 3: 20.2 µs per loop
>>> %timeit s.translate(None, whitespace+punctuation)[:200]
1000 loops, best of 3: 383 µs per loop
# For mid-sized n
>>> %timeit ''.join(islice((e for e in s if e.isalnum()), 10000))
1000 loops, best of 3: 930 µs per loop
>>> %timeit s.translate(None, whitespace+punctuation)[:10000]
1000 loops, best of 3: 378 µs per loop
# When n is comparable to length of string.
>>> %timeit ''.join(islice((e for e in s if e.isalnum()), 100000))
100 loops, best of 3: 9.41 ms per loop
>>> %timeit s.translate(None, whitespace+punctuation)[:100000]
1000 loops, best of 3: 385 µs per loop
If regular expressions aren't solving your problem, it could just be that you're not using enough of them yet :-) Here's a one-liner (discounting the import) that limits it to 20 characters (because your test data didn't match your specifications):
>>> import re
>>> string = "Special $#! character's spaces 888323"
>>> re.sub("[^A-Za-z0-9]","",string)[:20]
'Specialcharactersspa'
While not technically a generator, it will work just as well provided you're not having to process truly massive strings.
What it will do is avoid the split and rejoin in your original solution:
''.join(e for e in something)
No doubt there's some cost to the regular expression processing but I'd have a hard time believing it's as high as building a temporary list then tearing it down into a string again. Still, if you're concerned, you should measure, not guess!
If you want an actual generator, it's easy enough to implement one:
class alphanum(object):
def __init__(self, s, n):
self.s = s
self.n = n
self.ix = 0
def __iter__(self):
return self
def __next__(self):
return self.next()
def next(self):
if self.n <= 0:
raise StopIteration()
while self.ix < len(self.s) and not self.s[self.ix].isalnum():
self.ix += 1
if self.ix == len(self.s):
raise StopIteration()
self.ix += 1
self.n -= 1
return self.s[self.ix-1]
def remainder(self):
return ''.join([x for x in self])
for x in alphanum("Special $#! chars", 10):
print x
print alphanum("Special $#! chars", 10).remainder()
which shows how you can use it as a 'character' iterator as well as a string modifier:
S
p
e
c
i
a
l
c
h
a
Specialcha
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.