简体   繁体   中英

remove all special characters, punctuation from string and limit it to first 200 characters

Hi In need to remove all special characters, punctuation and spaces from a string so that I only have letters and numbers. The length of the final string should be only first 200 characters.

I know of a solution which is :-

string = "Special $#! character's   spaces 888323"

string = ''.join(e for e in string if e.isalnum())[:200]

But this will first remove all the unwanted characters and then slice it. Is there something that will work like a generator, ie as soon as total characters are 200, it should break. I want a pythonic solution. PS : I know I can achieve it via FOR loops.

from itertools import islice
"".join(islice((e for e in string if e.isalnum()), 200))

But personally, I think the for loop sounds a lot better to me.

Use a generator expression or function with itertools.islice :

from itertools import islice
s = "Special $#! character's   spaces 888323"
gen = (e for e in s if e.isalnum())
new_s = ''.join(islice(gen, 200))

Note that if the strings are not huge and the number n (200 here) is not small compared to string length then you should use str.translate with simple slicing as it is going to be very fast compared to a Python based for-loop:

>>> from string import whitespace, punctuation
>>> s.translate(None, whitespace+punctuation)[:10]
'Specialcha'

Some timing comparisons for a large string:

>>> s = "Special $#! character's   spaces 888323" * 10000
>>> len(s)
390000
# For very small n
>>> %timeit ''.join(islice((e for e in s if e.isalnum()), 200))
10000 loops, best of 3: 20.2 µs per loop
>>> %timeit s.translate(None, whitespace+punctuation)[:200]
1000 loops, best of 3: 383 µs per loop

# For mid-sized n
>>> %timeit ''.join(islice((e for e in s if e.isalnum()), 10000))
1000 loops, best of 3: 930 µs per loop
>>> %timeit s.translate(None, whitespace+punctuation)[:10000]
1000 loops, best of 3: 378 µs per loop

# When n is comparable to length of string.
>>> %timeit ''.join(islice((e for e in s if e.isalnum()), 100000))
100 loops, best of 3: 9.41 ms per loop
>>> %timeit s.translate(None, whitespace+punctuation)[:100000]
1000 loops, best of 3: 385 µs per loop

If regular expressions aren't solving your problem, it could just be that you're not using enough of them yet :-) Here's a one-liner (discounting the import) that limits it to 20 characters (because your test data didn't match your specifications):

>>> import re
>>> string = "Special $#! character's   spaces 888323"
>>> re.sub("[^A-Za-z0-9]","",string)[:20]
'Specialcharactersspa'

While not technically a generator, it will work just as well provided you're not having to process truly massive strings.

What it will do is avoid the split and rejoin in your original solution:

''.join(e for e in something)

No doubt there's some cost to the regular expression processing but I'd have a hard time believing it's as high as building a temporary list then tearing it down into a string again. Still, if you're concerned, you should measure, not guess!


If you want an actual generator, it's easy enough to implement one:

class alphanum(object):
    def __init__(self, s, n):
        self.s = s
        self.n = n
        self.ix = 0

    def __iter__(self):
        return self

    def __next__(self):
        return self.next()

    def next(self):
        if self.n <= 0:
            raise StopIteration()
        while self.ix < len(self.s) and not self.s[self.ix].isalnum():
            self.ix += 1
        if self.ix == len(self.s):
            raise StopIteration()

        self.ix += 1
        self.n -= 1
        return self.s[self.ix-1]

    def remainder(self):
        return ''.join([x for x in self])

for x in alphanum("Special $#! chars", 10):
    print x

print alphanum("Special $#! chars", 10).remainder()

which shows how you can use it as a 'character' iterator as well as a string modifier:

S
p
e
c
i
a
l
c
h
a
Specialcha

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM