简体   繁体   中英

iterate through unicode strings and compare with unicode in python dictionary


I have two python dictionaries containing information about japanese words and characters:

  1. vocabDic : contains vocabulary, key: word, value: dictionary with information about it
  2. kanjiDic : contains kanji ( single japanese character ), key: kanji, value: dictionary with information about it

    Now I would like to iterate through each character of each word in the vocabDic and look up this character in the kanji dictionary. My goal is to create a csv file which I can then import into a database as join table for vocabulary and kanji.
    My Python version is 2.6
    My code is as following:

     kanjiVocabJoinWriter = csv.writer(open('kanjiVocabJoin.csv', 'wb'), delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL) kanjiVocabJoinCount = 1 #loop through dictionary for key, val in vocabDic.iteritems(): if val['lang'] is 'jpn': # only check japanese words vocab = val['text'] print vocab # loop through vocab string for v in vocab: test = kanjiDic.get(v) print v print test if test is not None: print str(kanjiVocabJoinCount)+','+str(test['id'])+','+str(val['id']) kanjiVocabJoinWriter([str(kanjiVocabJoinCount),str(test['id']),str(val['id'])]) kanjiVocabJoinCount = kanjiVocabJoinCount+1 

If I print the variables to the command line, I get:
vocab : works, prints in japanese
v ( one character of the vocab in the for loop ) :
test ( character looked up in the kanjiDic ) : None

To me it seems like the for loop messes the encoding up.
I tried various functions ( decode, encode.. ) but no luck so far.
Any ideas on how I could get this working?
Help would be very much appreciated.

From your description of the problem, it sounds like vocab is an encoded str object, not a unicode object.

For concreteness, suppose vocab equals u'債務の天井' encoded in utf-8 :

In [42]: v=u'債務の天井'
In [43]: vocab=v.encode('utf-8')   # val['text']
Out[43]: '\xe5\x82\xb5\xe5\x8b\x99\xe3\x81\xae\xe5\xa4\xa9\xe4\xba\x95'

If you loop over the encoded str object, you get one byte at a time: \\xe5 , then \\x82 , then \\xb5 , etc.

However if you loop over the unicode object, you'd get one unicode character at a time:

In [45]: for v in u'債務の天井':
   ....:     print(v)    
債
務
の
天
井

Note that the first unicode character, encoded in utf-8 , is 3 bytes:

In [49]: u'債'.encode('utf-8')
Out[49]: '\xe5\x82\xb5'

That's why looping over the bytes, printing one byte at a time, (eg print \\xe5 ) fails to print a recognizable character.

So it looks like you need to decode your str objects and work with unicode objects. You didn't mention what encoding you are using for your str objects. If it is utf-8 , then you'd decode it like this:

vocab=val['text'].decode('utf-8')

If you are not sure what encoding val['text'] is in, post the output of

print(repr(vocab))

and maybe we can guess the encoding.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM