简体   繁体   English

将所有Unicode字符视为单个字母

[英]Treat all Unicode characters as single letters

I want to create a program that counts the "value" of a word by adding values given to letters of it based on their first position in a word (as an exercise, I'm new to Python). 我想创建一个程序来计算单词的“值”,方法是根据单词在单词中的第一个位置添加给单词的值(作为练习,我是Python的新手)。
Ie. 就是 "foo" would return 5 (as 'f' = 1, 'o' = 2) and "bar" would return 6 (as 'b' = 1, 'a' = 2, 'r' = 3). "foo"将返回5(因为“ f” = 1,“ o” = 2), "bar"将返回6(因为“ b” = 1,“ a” = 2,“ r” = 3)。

Here's my code so far: 到目前为止,这是我的代码:

# -*- coding: utf-8 -*-
 def ppn(word):
    word = list(word)
    cipher = dict()
    i = 1
    e = 0

    for letter in word:
        if letter not in cipher:
            cipher[letter] = i
            e += i
            i += 1
        else:
            e += cipher[letter]
    return ''.join(word) + ": " + str(e)


if __name__ == "__main__":
    print ppn(str(raw_input()))

It works well, however for words containing characters like 'ł', 'ą' etc. it doesn't return the correct value (I would guess it's because it translates these letters to Unicode codes first). 它的效果很好,但是对于包含“ł”,“±”等字符的单词,它不会返回正确的值(我想这是因为它首先将这些字母转换为Unicode代码)。 Is there a way to bypass it and make the interpreter treat all the letters as single letters? 有没有一种方法可以绕过它,并使口译员将所有字母视为单个字母?

Decode your input into unicode, then use unicode everywhere, then decode when you output. 将输入解码为unicode,然后在各处使用unicode,然后在输出时解码。

Specifically you will need to change 具体来说,您需要进行更改

print ppn(str(raw_input()))

To

print ppn(raw_input().decode(sys.stdin.encoding))

This will decode your input. 这将解码您的输入。 Then you will also need to change 然后,您还需要更改

''.join(word) + ": " + str(e)

To

u''.join(word) + u': ' + unicode(e)

This is making all your code use unicode objects internally. 这使您的所有代码在内部都使用unicode对象。

Print will encode the unicode properly to whatever encoding your terminal is using, but you can also specify it if you need to. Print可以将unicode正确编码为您的终端使用的任何编码,但是您也可以根据需要指定它。

Alternatively you can do exactly what you have already, but run it with python 3. 另外,您也可以执行已完成的操作,但是可以使用python 3运行它。

For more information, please read this very useful talk on the subject 有关更多信息,请阅读关于该主题的非常有用的演讲

Decode with the encoding of your shell: 用shell的编码解码:

if __name__ == "__main__":
    import sys
    print ppn((raw_input()).decode(sys.stdin.encoding))

For Unix system typically UTF-8 works. 对于Unix系统,通常使用UTF-8 On Windows things can be different. 在Windows上,情况可能有所不同。 To be save use sys.stdin.encoding . 要保存,请使用sys.stdin.encoding You never know where your script is going to run. 您永远不知道脚本将在哪里运行。

Or, even better. 或者,甚至更好。 switch to Python 3: 切换至Python 3:

# -*- coding: utf-8 -*-

import sys

assert sys.version_info.major > 2


def ppn(word):
    word = list(word)
    cipher = dict()
    i = 1
    e = 0

    for letter in word:
        if letter not in cipher:
            cipher[letter] = i
            e += i
            i += 1
        else:
            e += cipher[letter]
    return ''.join(word) + ": " + str(e)

if __name__ == "__main__":
    print(ppn(str(input())))

In Python 3 strings are unicode per default. 在Python 3中,默认情况下,字符串是unicode。 So no need for the decoding businesses. 因此,无需解码业务。

All the answers so far have explained what to do, but not what's going on, so here are some hints. 到目前为止,所有答案都说明了该怎么做,但并没有说明正在进行的事情,因此这里有一些提示。

When you use raw_input() with Python 2 you are returned a string of bytes ( input() on Python 3 behaves differently). 在Python 2中使用raw_input() ,将返回一个字节字符串input() Python 3上的input()表现不同)。 Most unicode characters cannot be represented as a single byte for the reason that there are more unicode characters than values that can be represented with a byte. 大多数unicode字符不能表示为单个字节,原因是unicode字符比可以用字节表示的值多。

Characters like ł or ą , when encoded with utf-8 or other encodings, can take two bytes or more: 当使用utf-8或其他编码对łą类的字符进行编码时,可能需要两个字节或更多个字节:

>>> 'ł'
'\xc5\x82'
>>> 'ą'
'\xc4\x85'

Your original program is interpreting those two bytes as distinct characters, leading to incorrect results. 您的原始程序将这两个字节解释为不同的字符,从而导致错误的结果。

Python offers an alternative to byte string: unicode strings. Python提供了一种替代字节字符串的方法:unicode字符串。 With unicode string, one character appears exactly as one character (the internal representation of the string is opaque), and the problem you are experiencing cannot occur. 使用unicode字符串,一个字符与一个字符完全一样(字符串的内部表示是不透明的)出现,并且不会出现您遇到的问题。

Therefore decoding the bytestring into a unicode string is the way to go. 因此,将字节字符串解码为unicode字符串是正确的方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM