简体   繁体   中英

UnicodeEncodeError when writing to file

I have a python script that works great on my local machine (OS X), but when I copied it to a server (Debian), it does not work as expected. The script reads an xml file and prints the contents in a new format. On my local machine, I can run the script with stdout to the terminal or to a file (ie > myFile.txt ), and both work fine.

However, on the server ( ssh ), when I print to terminal everything works fine, but printing to the file (which is what I really need) gives UnicodeEncodeError: UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128) . All files are in utf-8 encoding, and utf-8 is declared in the magic comment.

If I print the str objects inside a list (which is a trick I usually use to get a handle on encoding issues), it also throws the same error.

If I use print( x.encode('utf-8') ) , then it prints code-style bits (eg b'1' b'\\xd0\\x9a\\xd0\\xb0\\xd0\\xbc\\xd0\\xb0' ).

If I $ export PYTHONIOENCODING=utf-8 in the shell (as suggested in some SO posts), then I get a binary file: 1 <D0><9A><D0><B0><D0><BC><D0><B0> .

I have checked all of the locale variables and the relevant ones match what I have on my local machine.

I can simply process the file locally and upload it, but I really want to understand what is happening here. Since the python code is working on one computer, I am not sure that it is relevant, but I am adding it below:

# -*- encoding: utf-8 -*-
import sys, xml.etree.ElementTree as ET

corpus = ET.parse('file.xml')
text = corpus.getroot()
for body in text :
  for sent in body :
    depDOMs = [(0,'') for i in range(len(sent)+1)]
    for word in sent :
      if word.tag == 'LF' :
        pass
      elif 'ID' in word.attrib and 'FEAT' in word.attrib and 'DOM' in word.attrib :
        ID = word.attrib['ID']
        try :
          Form =  word.text.replace(' ','_')
        except AttributeError :
          Form = '_'
        try :
          Lemma =  word.attrib['LEMMA'].replace(' ', '_')
        except KeyError :
          Lemma = '*NULL*'
        CPOS = word.attrib['FEAT'].split()[0]
        POS = word.attrib['FEAT'].replace( ' ' , '_' )
        Feats = '_'
        Head = word.attrib['DOM']
        if Head == '_root' :
          Head = '0'
        try :
          DepRel = word.attrib['LINK']
        except KeyError :
          DepRel = 'ROOT'
        PHead = '_'
        PDepRel = '_'
        try:
          if word.attrib['NODETYPE'] == 'FANTOM' :
            word.attrib['LEMMA'] = '*'+word.attrib['LEMMA']+'*'
        except KeyError :
          pass
        print( ID , Form , Lemma , Feats, CPOS , POS , Head , DepRel , PHead , PDepRel , sep='\t' )
      else :
        print( 'WARNING: what is this?',sent.attrib['ID'],word.attrib)
  print()

The underlying issue may be caused by a miss configuration of Linux's locales, meaning that Python is being too cautious when printing non-ASCII chars.

Confirm locale configuration with locale . If there's a problem, you'll see something like:

$ locale 
locale: Cannot set LC_CTYPE to default locale: No such file or directory 
locale: Cannot set LC_ALL to default locale: No such file or directory 
LANG=en_US.UTF-8 
LANGUAGE= 

Fix this with:

$ sudo locale-gen "en_US.UTF-8"

(replace "en_US.UTF-8" with the locale that's not working). For further info, see: https://askubuntu.com/questions/162391/how-do-i-fix-my-locale-issue

You can find important information related to the error you are experiencing in the attributes of the UnicodeError based exception.

Quoting the documentation:

UnicodeError has attributes that describe the encoding or decoding error. For example, err.object[err.start:err.end] gives the particular invalid input that the codec failed on.

encoding

The name of the encoding that raised the error.

reason

A string describing the specific codec error.

object

The object the codec was attempting to encode or decode.

start

The first index of invalid data in object.

end

The index after the last invalid data in object.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM