Treat all Unicode characters as single letters

Question

I want to create a program that counts the "value" of a word by adding values given to letters of it based on their first position in a word (as an exercise, I'm new to Python).
Ie. "foo" would return 5 (as 'f' = 1, 'o' = 2) and "bar" would return 6 (as 'b' = 1, 'a' = 2, 'r' = 3).

Here's my code so far:

# -*- coding: utf-8 -*-
 def ppn(word):
    word = list(word)
    cipher = dict()
    i = 1
    e = 0

    for letter in word:
        if letter not in cipher:
            cipher[letter] = i
            e += i
            i += 1
        else:
            e += cipher[letter]
    return ''.join(word) + ": " + str(e)


if __name__ == "__main__":
    print ppn(str(raw_input()))

It works well, however for words containing characters like 'ł', 'ą' etc. it doesn't return the correct value (I would guess it's because it translates these letters to Unicode codes first). Is there a way to bypass it and make the interpreter treat all the letters as single letters?

Answer 1

Decode your input into unicode, then use unicode everywhere, then decode when you output.

Specifically you will need to change

print ppn(str(raw_input()))

To

print ppn(raw_input().decode(sys.stdin.encoding))

This will decode your input. Then you will also need to change

''.join(word) + ": " + str(e)

To

u''.join(word) + u': ' + unicode(e)

This is making all your code use unicode objects internally.

Print will encode the unicode properly to whatever encoding your terminal is using, but you can also specify it if you need to.

Alternatively you can do exactly what you have already, but run it with python 3.

For more information, please read this very useful talk on the subject

Answer 2

Decode with the encoding of your shell:

if __name__ == "__main__":
    import sys
    print ppn((raw_input()).decode(sys.stdin.encoding))

For Unix system typically UTF-8 works. On Windows things can be different. To be save use sys.stdin.encoding . You never know where your script is going to run.

Or, even better. switch to Python 3:

# -*- coding: utf-8 -*-

import sys

assert sys.version_info.major > 2


def ppn(word):
    word = list(word)
    cipher = dict()
    i = 1
    e = 0

    for letter in word:
        if letter not in cipher:
            cipher[letter] = i
            e += i
            i += 1
        else:
            e += cipher[letter]
    return ''.join(word) + ": " + str(e)

if __name__ == "__main__":
    print(ppn(str(input())))

In Python 3 strings are unicode per default. So no need for the decoding businesses.

Answer 3

All the answers so far have explained what to do, but not what's going on, so here are some hints.

When you use raw_input() with Python 2 you are returned a string of bytes ( input() on Python 3 behaves differently). Most unicode characters cannot be represented as a single byte for the reason that there are more unicode characters than values that can be represented with a byte.

Characters like ł or ą , when encoded with utf-8 or other encodings, can take two bytes or more:

>>> 'ł'
'\xc5\x82'
>>> 'ą'
'\xc4\x85'

Your original program is interpreting those two bytes as distinct characters, leading to incorrect results.

Python offers an alternative to byte string: unicode strings. With unicode string, one character appears exactly as one character (the internal representation of the string is opaque), and the problem you are experiencing cannot occur.

Therefore decoding the bytestring into a unicode string is the way to go.

Treat all Unicode characters as single letters

Question

3 answers

solution1
2 ACCPTED 2015-12-17 17:43:57

solution2
2 2015-12-17 17:45:05

solution3
2 2015-12-17 17:57:23

Treat all Unicode characters as single letters

Question

3 answers

solution1 2 ACCPTED 2015-12-17 17:43:57

solution2 2 2015-12-17 17:45:05

solution3 2 2015-12-17 17:57:23

solution1
2 ACCPTED 2015-12-17 17:43:57

solution2
2 2015-12-17 17:45:05

solution3
2 2015-12-17 17:57:23