简体   繁体   English

Python:使用CJKLIB将汉字转换为拼音

[英]Python: convert Chinese characters into pinyin with CJKLIB

I'm trying to convert a bunch of Chinese characters into pinyin, reading the characters from one file and writing the pinyin into another. 我正在尝试将一堆汉字转换成拼音,从一个文件中读取字符,然后将拼音写入另一个文件。 I'm working with the CJKLIB functions to do this. 我正在使用CJKLIB函数来执行此操作。

Here's the code, 这是代码,

from cjklib.characterlookup import CharacterLookup

source_file = 'cities_test.txt'
dest_file = 'output.txt'

s = open(source_file, 'r')
d = open(dest_file, 'w')

cjk = CharacterLookup('T')

for line in s:
    p = line.split('\t')
    for p_shard in p:
        for c in p_shard:
            readings = cjk.getReadingForCharacter(c.encode('utf-8'), 'Pinyin')
            d.write(readings[0].encode('utf-8'))
        d.write('\t')
    d.write('\n')

s.close()
d.close()

My problem is that I keep running into Unicode-related errors, the error comes up when I call the getReadingForCharacter function. 我的问题是,我一直遇到与Unicode相关的错误,当我调用getReadingForCharacter函数时会出现该错误。 If I called it as written, 如果按书面要求

readings = cjk.getReadingForCharacter(c.encode('utf-8'), 'Pinyin')

I get: UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range (128). 我得到:UnicodeDecodeError:'ascii'编解码器无法解码位置0的字节0xef:序数不在范围内(128)。

If I call it like this, without the .encoding() , 如果我这样称呼它,而没有.encoding()

readings = cjk.getReadingForCharacter(c, 'Pinyin')

I get an error thrown by sqlalchemy (the CJKLIB uses sqlalchemy and sqlite): You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings ... etc. 我收到sqlalchemy抛出的错误(CJKLIB使用sqlalchemy和sqlite):除非使用可以解释8位字节串的text_factory ...,否则不得使用8位字节串。

Can someone help me out? 有人可以帮我吗? Thanks! 谢谢!

Oh also, is there a way for CJKLIB to return the pinyin without any tones? 哦,还有,CJKLIB是否可以不带任何提示音返回拼音? I think by default it's returning pinyin with these weird characters to represent tones, I just want the letters without these tones. 我认为默认情况下,它会返回带有这些怪异字符的拼音来表示音调,我只希望字母没有这些音调。

Your bug is that you are not decoding the input stream, and yet you are turning around and re-encoding it as though it were UTF-8. 您的错误是您没有解码输入流,但是却转过身来对其重新编码 ,就好像它是UTF-8。 That's going the wrong way. 这是错误的方式。

You have two choices. 您有两种选择。

You can codecs.open the input file with an explicit encoding so you always get back regular Unicode strings whenever you read from it because the decoding is automatic. 您可以使用显式编码codecs.open打开输入文件,因此每次读取文件时始终会返回常规Unicode字符串,因为解码是自动进行的。 This is always my strong preference. 这始终是我的强烈偏好。 There is no such thing as a text file anymore. 不再有文本文件之类的东西。

Your other choice is to manually decode your binary string it before you pass it to the function. 您的另一选择是在将二进制字符串传递给函数之前,对其进行手动解码。 I hate this style, because it almost always indicates that you're doing something wrong, and even when it doesn't, it is clumsy as all get out. 我讨厌这种风格,因为它几乎总是表示您做错了什么,即使没有,也很笨拙。

I would do the same thing for the output file. 我将对输出文件执行相同的操作。 I just hate seeing manually .encode("utf-8") and .decode("utf-8") all over the place. 我只是讨厌.encode("utf-8")看到手动.encode("utf-8").decode("utf-8") Set the stream encoding and be done with it. 设置流编码并完成编码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM