简体   繁体   中英

Remove symbols from string but keep whitespaces

I need to remove special characters from a string but I also need to keep whitespaces. This is my code so far:

from unidecode import unidecode
import re

def cleanstr(string):
    if isinstance(string, str):
        string = string.decode('utf-8')
    string = unidecode(string)
    string = re.sub('[^A-Za-z0-9]+', '', string)
    return string

print cleanstr("She's my friend Adélaïde")
>> ShesmyfriendAdelaide

The expected result should be Shes my friend Adelaide .

Without regular expressions

import string

sentence = "vg583$%#jgv f_vrefg fh4ufrh4 %# dhejrfh #"

print "".join([s for s in sentence if s in string.ascii_letters + string.digits + ' '])

Output

'vg583jgv fvrefg fh4ufrh4  dhejrfh'

I admit, can not handle unicode at the moment but you may need to tweak it a bit.

I think your final solution (in case you do want to deal with unicode) should look like this:

u''.join([transform_char(c) for c in your_unicode_string if condition_met(c)])
[^A-Za-z0-9]+

Here you're matching characters that are not AZ, az or 0-9.

You replace these characters with the empty string; that is, you remove them.

If you want to remove other characters, then simply add them to this list!
\\s means whitespace, so:

[^A-Za-z0-9\s]+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM