简体   繁体   English

迭代unicode字符串并与python字典中的unicode进行比较

[英]iterate through unicode strings and compare with unicode in python dictionary


I have two python dictionaries containing information about japanese words and characters: 我有两个包含日语单词和字符信息的python词典:

  1. vocabDic : contains vocabulary, key: word, value: dictionary with information about it vocabDic:包含词汇,关键词:单词,值:带有相关信息的词典
  2. kanjiDic : contains kanji ( single japanese character ), key: kanji, value: dictionary with information about it kanjiDic:包含汉字(单日语字符),键:汉字,值:字典及其相关信息

    Now I would like to iterate through each character of each word in the vocabDic and look up this character in the kanji dictionary. 现在我想遍历vocabDic中每个单词的每个字符,并在汉字字典中查找这个字符。 My goal is to create a csv file which I can then import into a database as join table for vocabulary and kanji. 我的目标是创建一个csv文件,然后我可以将其作为词汇表和汉字的连接表导入数据库。
    My Python version is 2.6 我的Python版本是2.6
    My code is as following: 我的代码如下:

     kanjiVocabJoinWriter = csv.writer(open('kanjiVocabJoin.csv', 'wb'), delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL) kanjiVocabJoinCount = 1 #loop through dictionary for key, val in vocabDic.iteritems(): if val['lang'] is 'jpn': # only check japanese words vocab = val['text'] print vocab # loop through vocab string for v in vocab: test = kanjiDic.get(v) print v print test if test is not None: print str(kanjiVocabJoinCount)+','+str(test['id'])+','+str(val['id']) kanjiVocabJoinWriter([str(kanjiVocabJoinCount),str(test['id']),str(val['id'])]) kanjiVocabJoinCount = kanjiVocabJoinCount+1 

If I print the variables to the command line, I get: 如果我将变量打印到命令行,我得到:
vocab : works, prints in japanese 词汇:作品,日文版画
v ( one character of the vocab in the for loop ) : v(for循环中词汇的一个字符):
test ( character looked up in the kanjiDic ) : None 测试(字符在kanjiDic中查找):无

To me it seems like the for loop messes the encoding up. 对我而言,似乎for循环使编码变得混乱。
I tried various functions ( decode, encode.. ) but no luck so far. 我尝试了各种功能(解码,编码..)但到目前为止没有运气。
Any ideas on how I could get this working? 关于如何让这个工作的任何想法?
Help would be very much appreciated. 非常感谢帮助。

From your description of the problem, it sounds like vocab is an encoded str object, not a unicode object. 根据您对问题的描述,听起来vocab是一个编码的str对象,而不是unicode对象。

For concreteness, suppose vocab equals u'債務の天井' encoded in utf-8 : u'債務の天井' ,假设vocab等于用utf-8编码的u'債務の天井'

In [42]: v=u'債務の天井'
In [43]: vocab=v.encode('utf-8')   # val['text']
Out[43]: '\xe5\x82\xb5\xe5\x8b\x99\xe3\x81\xae\xe5\xa4\xa9\xe4\xba\x95'

If you loop over the encoded str object, you get one byte at a time: \\xe5 , then \\x82 , then \\xb5 , etc. 如果您遍历所有的编码str对象,你在一个时间内得到一个字节: \\xe5 ,然后\\x82 ,然后\\xb5等。

However if you loop over the unicode object, you'd get one unicode character at a time: 但是,如果循环遍历unicode对象,则一次只能获得一个unicode字符:

In [45]: for v in u'債務の天井':
   ....:     print(v)    
債
務
の
天
井

Note that the first unicode character, encoded in utf-8 , is 3 bytes: 请注意,以utf-8编码的第一个unicode字符是3个字节:

In [49]: u'債'.encode('utf-8')
Out[49]: '\xe5\x82\xb5'

That's why looping over the bytes, printing one byte at a time, (eg print \\xe5 ) fails to print a recognizable character. 这就是循环字节,一次打印一个字节(例如print \\xe5 )无法打印可识别字符的原因。

So it looks like you need to decode your str objects and work with unicode objects. 所以看起来你需要解码你的str对象并使用unicode对象。 You didn't mention what encoding you are using for your str objects. 您没有提到您为str对象使用的编码。 If it is utf-8 , then you'd decode it like this: 如果它是utf-8 ,那么你就像这样解码它:

vocab=val['text'].decode('utf-8')

If you are not sure what encoding val['text'] is in, post the output of 如果你不确定编码val['text']是什么,请发布输出

print(repr(vocab))

and maybe we can guess the encoding. 也许我们可以猜测编码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM