Python: Unicode dictionary

Question

I am trying to read in a tab separated utf-8 file to a Python dictionary and then check if the user's input is present as a dictionary key.

This is what I tried so far.

#!/usr/bin/env python
# -*- coding: utf-8 -*- 

def create_idiom_dic(idiom_file):    
  newDict= {}
  with io.open(idiom_file, 'r', encoding='utf-8') as f :
    for line in f:
        line = line.strip()
        if line:
            splitLine = line.split("\t")
            newDict[splitLine[0]] = ",".join(splitLine[1:])
  return newDict

def check_idioms(text, lang):
  si_idioms = create_idiom_dic("si.txt")
  ta_idioms = create_idiom_dic("ta.txt")    
  if lang == "si":
    if text in si_idioms:
        print si_idioms[text] 
  else:
        print si_idioms[text]

The issue is that I cannot get the matching value when the correct key is given. Suppose I give the input text as

print ta_idioms[text]

where the text is "அவர் சிவலோக பதவி." This gives a key error:

KeyError: '\xe0\xae\x85\xe0\xae\xb5\xe0\xae\xb0\xe0\xaf\x8d \xe0\xae\x9a\xe0\xae\xbf\xe0\xae\xb5\xe0\xae\xb2\xe0\xaf\x8b\xe0\xae\x95 \xe0\xae\xaa\xe0\xae\xa4\xe0\xae\xb5\xe0\xae\xbf.'

However, if I try out printing.

print ta_idioms["அவர் சிவலோக பதவி."]

This gives the correct result.

--UPDATE--

After some more effort, I discovered the error was due to the encoding of the input text I was passing in. The input is initially read from a text file which I open and call the check_idiom function as below. I attempted decoding the text but it gives an encoding error.

if __name__ == '__main__':
  with io.open(input_file, 'r', encoding='utf-8') as f:
    for line in f:
        text = line.strip().decode('utf-8')
        print text  
        #check_idioms(text, lang)

Which in turn returns the error:

text = line.strip().decode('utf-8')
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not  in range(128)

What should be done to resolve this issue?

--UPDATE 2-- The hex dump of the file is:

0000000   �   � 221   �   � 232       �   � 234   �   � 204   �   � 232
0000010       �   � 234   �   � 231   �   �   �   �   � 222            
000001e

Answer 1

There are multiple ways to represent the same text in Unicode. In order to use Unicode text as a key, you need to normalize it. The Unicode standard defines four different normalization forms, but for this particular case, it doesn't matter much which one you choose, as long as you use it consistently.

from unicodedata import normalize

# ...
    newDict[normalize('NFKC', splitLine[0])] = ",".join(splitLine[1:])

# ...
print ta_idioms[normalize('NFKC',u"அவர் சிவலோக பதவி.")]

See unicodedata for (brief) documentation, and http://en.wikipedia.org/wiki/Unicode_equivalence

If the variable text is not a proper Unicode string, you need to convert it before you use it as a key. I'm guessing in this case you are looking for

print ta_idioms[normalize('NFKC', text.decode('utf-8'))]

... or better yet, whatever initializes text should properly decode it in the first place.

The first bytes of the string in the error message seem to represent the UTF-8 encoding of U+0B85 which suggests that you have not done the decoding when reading the file like you show in your example; or perhaps the input file is erroneous, and contains text which was double-encoded .

You might want to consider encapsulating the Unicode normalization by creating a specific class for ta_idioms which provide accessor functions which take care of this detail, so that the code which uses it doesn't have to.

Python: Unicode dictionary

Question

1 answers

solution1
1 ACCPTED 2014-12-10 06:21:58

Python: Unicode dictionary

Question

1 answers

solution1 1 ACCPTED 2014-12-10 06:21:58

solution1
1 ACCPTED 2014-12-10 06:21:58