I need to remove special characters from a string but I also need to keep whitespaces. This is my code so far:
from unidecode import unidecode
import re
def cleanstr(string):
if isinstance(string, str):
string = string.decode('utf-8')
string = unidecode(string)
string = re.sub('[^A-Za-z0-9]+', '', string)
return string
print cleanstr("She's my friend Adélaïde")
>> ShesmyfriendAdelaide
The expected result should be Shes my friend Adelaide
.
Without regular expressions
import string
sentence = "vg583$%#jgv f_vrefg fh4ufrh4 %# dhejrfh #"
print "".join([s for s in sentence if s in string.ascii_letters + string.digits + ' '])
Output
'vg583jgv fvrefg fh4ufrh4 dhejrfh'
I admit, can not handle unicode at the moment but you may need to tweak it a bit.
I think your final solution (in case you do want to deal with unicode) should look like this:
u''.join([transform_char(c) for c in your_unicode_string if condition_met(c)])
[^A-Za-z0-9]+
Here you're matching characters that are not AZ, az or 0-9.
You replace these characters with the empty string; that is, you remove them.
If you want to remove other characters, then simply add them to this list!
\\s
means whitespace, so:
[^A-Za-z0-9\s]+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.