Removing special characters (¡) from a string

Question

I am trying to write into a file from a collection. The collection has special characters like ¡ which create a problem. For example the content in the collection has details like:

{..., Name: ¡Hi!, ...}

Now I am trying to write the same into a file but I get the error

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa1' in position 0: ordinal not in range(128)

I have tried the using the solutions provided here but in vain. It will be great if someone could help me with this :)

So the example goes like this:

I have a collection which has the following details

{ "_id":ObjectId("5428ead854fed46f5ec4a0c9"), 
   "author":null,
   "class":"culture",
   "created":1411967707.356593,
   "description":null,
   "id":"eba9b4e2-900f-4707-b57d-aa659cbd0ac9",
   "name":"¡Hola!",
   "reviews":[

   ],
   "screenshot_urls":[

   ]
}

Now I try to access the name entry here from the collection and I do that by iterating it over the collection ie

f = open("sample.txt","w");

for val in exampleCollection:
   f.write("%s"%str(exampleCollection[val]).encode("utf-8"))

f.close();

Answer 1

The easiest way to remove characters you don't want is to specify the characters you do.

>>> import string
>>> validchars = string.ascii_letters + string.digits + ' '
>>> s = '¡Hi there!'
>>> clean = ''.join(c for c in s if c in validchars)
>>> clean
'Hi there'

If some forms of punctuation are okay, add them to validchars.

Answer 2

This will remove all the characters in the string which are not valid ASCII.

>>> '¡Hola!'.encode('ascii', 'ignore').decode('ascii')
'Hola!'

Alternatively, you can write the file as UTF-8 , which can represent nearly all characters on Earth.

Answer 3

As one user posted on this page, you should take a look at the Unicode tutorial in the docs: https://docs.python.org/2/howto/unicode.html

What's happening is you're trying to use a character that's outside the ASCII range, which is a mere 128 symbols. There's a really great article on this I found a while back, which I'll try to find and post here.

Edit: ah, here it is: http://www.joelonsoftware.com/articles/Unicode.html

Answer 4

You're trying to convert unicode to ascii in "strict" mode:

>>> help(str.encode)
Help on method_descriptor:

encode(...)
    S.encode([encoding[,errors]]) -> object

    Encodes S using the codec registered for encoding. encoding defaults
    to the default encoding. errors may be given to set a different error
    handling scheme. Default is 'strict' meaning that encoding errors raise
    a UnicodeEncodeError. Other possible values are 'ignore', 'replace' and
    'xmlcharrefreplace' as well as any other name registered with
    codecs.register_error that is able to handle UnicodeEncodeErrors.

You probably want something like one of the following:

s = u'¡Hi there!'

print s.encode('ascii', 'ignore')    # removes the ¡
print s.encode('ascii', 'replace')   # replaces with ?
print s.encode('ascii','xmlcharrefreplace') # turn into xml entities
print s.encode('ascii', 'strict')    # throw UnicodeEncodeErrors

Removing special characters (¡) from a string

Question

4 answers

solution1
2 ACCPTED 2015-10-23 18:52:51

solution2
1 2015-10-23 19:07:27

solution3
0 2015-10-23 18:45:49

solution4
0 2015-10-23 19:18:06

Removing special characters (¡) from a string

Question

4 answers

solution1 2 ACCPTED 2015-10-23 18:52:51

solution2 1 2015-10-23 19:07:27

solution3 0 2015-10-23 18:45:49

solution4 0 2015-10-23 19:18:06

solution1
2 ACCPTED 2015-10-23 18:52:51

solution2
1 2015-10-23 19:07:27

solution3
0 2015-10-23 18:45:49

solution4
0 2015-10-23 19:18:06