简体   繁体   English

从字符串中删除所有特殊字符,标点符号,并将其限制为前200个字符

[英]remove all special characters, punctuation from string and limit it to first 200 characters

Hi In need to remove all special characters, punctuation and spaces from a string so that I only have letters and numbers. 您好,In需要删除字符串中的所有特殊字符,标点符号和空格,以便我只有字母和数字。 The length of the final string should be only first 200 characters. 最终字符串的长度应仅为前200个字符。

I know of a solution which is :- 我知道一个解决方案是:

string = "Special $#! character's   spaces 888323"

string = ''.join(e for e in string if e.isalnum())[:200]

But this will first remove all the unwanted characters and then slice it. 但这将首先删除所有不需要的字符,然后对其进行切片。 Is there something that will work like a generator, ie as soon as total characters are 200, it should break. 是否有可以像生成器那样工作的东西,即,一旦总字符数为200,它就会崩溃。 I want a pythonic solution. 我想要一个pythonic解决方案。 PS : I know I can achieve it via FOR loops. PS:我知道我可以通过FOR循环来实现。

from itertools import islice
"".join(islice((e for e in string if e.isalnum()), 200))

But personally, I think the for loop sounds a lot better to me. 但就我个人而言,我认为for循环对我来说听起来要好得多。

Use a generator expression or function with itertools.islice : 将生成器表达式或函数与itertools.islice

from itertools import islice
s = "Special $#! character's   spaces 888323"
gen = (e for e in s if e.isalnum())
new_s = ''.join(islice(gen, 200))

Note that if the strings are not huge and the number n (200 here) is not small compared to string length then you should use str.translate with simple slicing as it is going to be very fast compared to a Python based for-loop: 请注意,如果字符串不是很大,并且数字n (此处为200)与字符串长度相比不小,则应使用str.translate进行简单切片,因为与基于Python的for循环相比,它将非常快:

>>> from string import whitespace, punctuation
>>> s.translate(None, whitespace+punctuation)[:10]
'Specialcha'

Some timing comparisons for a large string: 大字符串的一些时间比较:

>>> s = "Special $#! character's   spaces 888323" * 10000
>>> len(s)
390000
# For very small n
>>> %timeit ''.join(islice((e for e in s if e.isalnum()), 200))
10000 loops, best of 3: 20.2 µs per loop
>>> %timeit s.translate(None, whitespace+punctuation)[:200]
1000 loops, best of 3: 383 µs per loop

# For mid-sized n
>>> %timeit ''.join(islice((e for e in s if e.isalnum()), 10000))
1000 loops, best of 3: 930 µs per loop
>>> %timeit s.translate(None, whitespace+punctuation)[:10000]
1000 loops, best of 3: 378 µs per loop

# When n is comparable to length of string.
>>> %timeit ''.join(islice((e for e in s if e.isalnum()), 100000))
100 loops, best of 3: 9.41 ms per loop
>>> %timeit s.translate(None, whitespace+punctuation)[:100000]
1000 loops, best of 3: 385 µs per loop

If regular expressions aren't solving your problem, it could just be that you're not using enough of them yet :-) Here's a one-liner (discounting the import) that limits it to 20 characters (because your test data didn't match your specifications): 如果正则表达式不能解决您的问题,则可能只是您尚未使用足够的正则表达式:-)这是一种单行代码(可对导入进行折价),将其限制为20个字符(因为您的测试数据没有)不符合您的规格):

>>> import re
>>> string = "Special $#! character's   spaces 888323"
>>> re.sub("[^A-Za-z0-9]","",string)[:20]
'Specialcharactersspa'

While not technically a generator, it will work just as well provided you're not having to process truly massive strings. 虽然从技术上讲不是生成器,但只要您不必处理真正的大量字符串,它就可以正常工作。

What it will do is avoid the split and rejoin in your original solution: 的作用是避免了分裂,并在原来的解决方案归队:

''.join(e for e in something)

No doubt there's some cost to the regular expression processing but I'd have a hard time believing it's as high as building a temporary list then tearing it down into a string again. 毫无疑问,正则表达式处理会产生一些成本,但是我很难相信它与建立一个临时列表然后再将其拆成字符串一样高。 Still, if you're concerned, you should measure, not guess! 不过,如果您担心的话,应该测量而不是猜测!


If you want an actual generator, it's easy enough to implement one: 如果您想要一个实际的生成器,则很容易实现一个生成器:

class alphanum(object):
    def __init__(self, s, n):
        self.s = s
        self.n = n
        self.ix = 0

    def __iter__(self):
        return self

    def __next__(self):
        return self.next()

    def next(self):
        if self.n <= 0:
            raise StopIteration()
        while self.ix < len(self.s) and not self.s[self.ix].isalnum():
            self.ix += 1
        if self.ix == len(self.s):
            raise StopIteration()

        self.ix += 1
        self.n -= 1
        return self.s[self.ix-1]

    def remainder(self):
        return ''.join([x for x in self])

for x in alphanum("Special $#! chars", 10):
    print x

print alphanum("Special $#! chars", 10).remainder()

which shows how you can use it as a 'character' iterator as well as a string modifier: 它显示了如何将其用作“字符”迭代器以及字符串修饰符:

S
p
e
c
i
a
l
c
h
a
Specialcha

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM