简体   繁体   English

将字符转换为unicode

[英]convert character to unicode

This code is working as expected. 此代码按预期工作。 Only problem is that if there is unicode character, it gets converted to ASCII. 唯一的问题是,如果有unicode字符,它将转换为ASCII。

with open('test.idx', 'w') as writefile:
    with open('test.dat') as myfile:
        mystr=myfile.read()
        for myword in mystr.split('|'):
            tow=myword, '|', mystr.index(myword)
            print >>writefile, tow

In [74]: !cat test.dat
UTF-8
जनन|1
जन्म देणे
शिक्षण|1
 क्षेत्रातील संशोधनाच्या बाजारीकरणा बाबतीत व्यक्त केलेली 
पूर्व|1
 पगड्यामुळे

In [75]: !cat test.idx
('UTF-8\n\xe0\xa4\x9c\xe0\xa4\xa8\xe0\xa4\xa8', '|', 0)

I expect to see unicode instead of escaped code. 我希望看到unicode而不是转义代码。

You created a tuple: 你创建了一个元组:

tow=myword, '|', mystr.index(myword)

That's not a string object, that's a tuple containing three other objects, two of which are strings, one an integer. 这不是一个字符串对象,它是一个包含三个其他对象的元组,其中两个是字符串,一个是整数。

When you then write that tuple to the file, Python has to convert it to a string. 然后,当您将该元组写入文件时,Python必须将其转换为字符串。 Converting any Python container (be that a tuple, a list, a set or a dictionary) will use the repr() representation of the contained objects. 转换任何Python容器(无论是元组,列表,集合还是字典)都将使用所包含对象的repr()表示。 For strings, that means only printable ASCII characters are allowed and shown, everything else uses escape sequences, most often the \\xhh form. 对于字符串,这意味着只允许和显示可打印的ASCII字符,其他所有字符都使用转义序列,通常是\\xhh表单。

If that is not correct output for your usecase, you need to do the string conversion yourself. 如果您的用例输出不正确,则需要自己进行字符串转换。 You could use string formatting: 您可以使用字符串格式:

tow = '{}|{}'.format(myword, mystr.index(myword))

If you are producing a lot of | 如果你正在生产很多| -separated data, you may want to look at the csv module instead to handle the separator and file writing. - 分离数据,您可能需要查看csv模块来处理分隔符和文件写入。

You are seeing the repr representation as you are storing the data in a tuple. 当您将数据存储在元组中时,您将看到repr表示。 To match your expected output use str.join: 要匹配您的预期输出,请使用str.join:

     print >>writefile, "".join(map(str,tow))

The output file will contain: 输出文件将包含:

UTF-8
जनन|0
1
जन्म देणे
शिक्षण|16
1
 क्षेत्रातील संशोधनाच्या बाजारीकरणा बाबतीत व्यक्त केलेली 
पूर्व|63
1
 पगड्यामुळे|239

If you add a print(tow) in your code you will see you have tuples. 如果在代码中添加print(tow) ,您将看到有元组。

('UTF-8\n\xe0\xa4\x9c\xe0\xa4\xa8\xe0\xa4\xa8', '|', 0)
('1\n\xe0\xa4\x9c\xe0\xa4\xa8\xe0\xa5\x8d\xe0\xa4\xae \xe0\xa4\xa6\xe0\xa5\x87\xe0\xa4\xa3\xe0\xa5\x87\n\xe0\xa4\xb6\xe0\xa4\xbf\xe0\xa4\x95\xe0\xa5\x8d\xe0\xa4\xb7\xe0\xa4\xa3', '|', 16)
('1\n \xe0\xa4\x95\xe0\xa5\x8d\xe0\xa4\xb7\xe0\xa5\x87\xe0\xa4\xa4\xe0\xa5\x8d\xe0\xa4\xb0\xe0\xa4\xbe\xe0\xa4\xa4\xe0\xa5\x80\xe0\xa4\xb2 \xe0\xa4\xb8\xe0\xa4\x82\xe0\xa4\xb6\xe0\xa5\x8b\xe0\xa4\xa7\xe0\xa4\xa8\xe0\xa4\xbe\xe0\xa4\x9a\xe0\xa5\x8d\xe0\xa4\xaf\xe0\xa4\xbe \xe0\xa4\xac\xe0\xa4\xbe\xe0\xa4\x9c\xe0\xa4\xbe\xe0\xa4\xb0\xe0\xa5\x80\xe0\xa4\x95\xe0\xa4\xb0\xe0\xa4\xa3\xe0\xa4\xbe \xe0\xa4\xac\xe0\xa4\xbe\xe0\xa4\xac\xe0\xa4\xa4\xe0\xa5\x80\xe0\xa4\xa4 \xe0\xa4\xb5\xe0\xa5\x8d\xe0\xa4\xaf\xe0\xa4\x95\xe0\xa5\x8d\xe0\xa4\xa4 \xe0\xa4\x95\xe0\xa5\x87\xe0\xa4\xb2\xe0\xa5\x87\xe0\xa4\xb2\xe0\xa5\x80 \n\xe0\xa4\xaa\xe0\xa5\x82\xe0\xa4\xb0\xe0\xa5\x8d\xe0\xa4\xb5', '|', 63)
('1\n \xe0\xa4\xaa\xe0\xa4\x97\xe0\xa4\xa1\xe0\xa5\x8d\xe0\xa4\xaf\xe0\xa4\xbe\xe0\xa4\xae\xe0\xa5\x81\xe0\xa4\xb3\xe0\xa5\x87', '|', 239)

You also have utf-8 encoded strings not unicode, if you printed the individual elements from tow you would also see the correct output. 你也有utf-8编码的字符串而不是unicode,如果你从tow中打印出单个元素,你也会看到正确的输出。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM