简体   繁体   中英

Trouble with utf-8 encoding/decoding

I am reading a .csv which is UTF-8 encoded. I want to create an index and rewrite the csv . The index is created as an ongoing number and the first letter of a word. Python 2.7.10, Ubuntu Server

#!/usr/bin/env python
# -*- coding: utf-8 -*-
counter = 0
tempDict = {}
with open(modifiedFile, "wb") as newFile:
    with open(originalFile, "r") as file:
        for row in file:
            myList = row.split(",")
            toId = str(myList[0])

            if toId not in tempDict:
                tempDict[toId] = counter
                myId = str(toId[0]) + str(counter)
                myList.append(myId)
                counter += 1
            else:
                myId = str(toId[0]) + str(tempDict[toId])
                myList.append(myId)

            # and then I write everything into the csv
            for i, j in enumerate(myList):
                if i < 6:
                    newFile.write(str(j).strip())
                    newFile.write(",")

                else: 
                    newFile.write(str(j).strip())
                    newFile.write("\n")

The problem is the following. When a word starts with a fancy letter, such as

  • Č
  • É
  • Ā
  • ...

The id I create starts with a ? , but not with the letter of the word. The strange part is, that withing the csv I create, the words with the fancy letters are written correct. There are no ? or other symbols which indicate a wrong encoding.

Why is that?

By all means, you should not be learning Python 2 unless there is a specific legacy C extension that you need.

Python 3 makes major changes to the unicode/bytes handling that removes (most) implicit behavior and makes errors visible. It's still good practice to use open('filename', encoding='utf-8') since the default encoding is environment- and platform-dependent.

Indeed, running your program in Python 3 should fix it without any changes. But here's where your bug lies:

        toId = str(myList[0])

This is a no-op, since myList[0] is already a str .

            myId = str(toId[0]) + str(counter)

This is a bug: toId is a str (byte string) containing UTF-8 data. You never, ever want to do anything with UTF-8 data except process it one character at a time.

with open(originalFile, "r") as file:

This is a style error, since it masks the built-in function file .

There are two changes to make this run under Python 2.

  1. Change open(filename, mode) to io.open(filename, mode, encoding='utf-8') .
  2. Stop calling str() on strings, since that actually attempts to encode them (in ASCII!).

But you really should switch to Python 3.

There are a few pieces new to 2.6 and 2.7 that are intended to bridge the gap to 3, and one of them is the io module, which behaves in all the nice new ways: Unicode files and universal newlines.

~$ python2.7 -c 'import io,sys;print(list(io.open(sys.argv[1],encoding="u8")))' <(printf $'\xc3\x84\r\n\xc3\xb9\r\n')
[u'\xc4\n', u'\xf9\n']
~$ python3 -c 'import sys;print(list(open(sys.argv[1],encoding="u8")))' <(printf $'\xc3\x84\r\n\xc3\xb9\r\n')
['Ä\n', 'ù\n']

This can be useful to write software for both 2 and 3. Again, the encoding argument is optional but on all platforms the default encoding is environment-dependent, so it's good to be specific.

In python 2.x strings are by default non-unicode - str() returns a non-unicode string. Use unicode() instead.

Besides, you must open the file using utf-8 encoding through codecs.open() rather than the built-in open() .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM