简体   繁体   English

写入文件时出现 UnicodeEncodeError

[英]UnicodeEncodeError when writing to file

I have a python script that works great on my local machine (OS X), but when I copied it to a server (Debian), it does not work as expected.我有一个在我的本地机器 (OS X) 上运行良好的 python 脚本,但是当我将它复制到服务器 (Debian) 时,它没有按预期工作。 The script reads an xml file and prints the contents in a new format.该脚本读取 xml 文件并以新格式打印内容。 On my local machine, I can run the script with stdout to the terminal or to a file (ie > myFile.txt ), and both work fine.在我的本地机器上,我可以将带有 stdout 的脚本运行到终端或文件(即> myFile.txt ),并且两者都可以正常工作。

However, on the server ( ssh ), when I print to terminal everything works fine, but printing to the file (which is what I really need) gives UnicodeEncodeError: UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128) .但是,在服务器( ssh )上,当我打印到终端时一切正常,但是打印到文件(这是我真正需要的)会给出 UnicodeEncodeError: UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128) All files are in utf-8 encoding, and utf-8 is declared in the magic comment.所有文件都是utf-8编码,utf-8在magic注释中声明。

If I print the str objects inside a list (which is a trick I usually use to get a handle on encoding issues), it also throws the same error.如果我在列表中打印str对象(这是我通常用来处理编码问题的技巧),它也会引发相同的错误。

If I use print( x.encode('utf-8') ) , then it prints code-style bits (eg b'1' b'\\xd0\\x9a\\xd0\\xb0\\xd0\\xbc\\xd0\\xb0' ).如果我使用print( x.encode('utf-8') ) ,那么它会打印代码样式位(例如b'1' b'\\xd0\\x9a\\xd0\\xb0\\xd0\\xbc\\xd0\\xb0' ) .

If I $ export PYTHONIOENCODING=utf-8 in the shell (as suggested in some SO posts), then I get a binary file: 1 <D0><9A><D0><B0><D0><BC><D0><B0> .如果我在 shell 中$ export PYTHONIOENCODING=utf-8 (如某些 SO 帖子中所建议的那样),那么我会得到一个二进制文件: 1 <D0><9A><D0><B0><D0><BC><D0><B0>

I have checked all of the locale variables and the relevant ones match what I have on my local machine.我已经检查了所有locale变量,相关变量与我在本地机器上的匹配。

I can simply process the file locally and upload it, but I really want to understand what is happening here.我可以简单地在本地处理文件并上传它,但我真的很想了解这里发生了什么。 Since the python code is working on one computer, I am not sure that it is relevant, but I am adding it below:由于 python 代码在一台计算机上运行,​​我不确定它是否相关,但我在下面添加它:

# -*- encoding: utf-8 -*-
import sys, xml.etree.ElementTree as ET

corpus = ET.parse('file.xml')
text = corpus.getroot()
for body in text :
  for sent in body :
    depDOMs = [(0,'') for i in range(len(sent)+1)]
    for word in sent :
      if word.tag == 'LF' :
        pass
      elif 'ID' in word.attrib and 'FEAT' in word.attrib and 'DOM' in word.attrib :
        ID = word.attrib['ID']
        try :
          Form =  word.text.replace(' ','_')
        except AttributeError :
          Form = '_'
        try :
          Lemma =  word.attrib['LEMMA'].replace(' ', '_')
        except KeyError :
          Lemma = '*NULL*'
        CPOS = word.attrib['FEAT'].split()[0]
        POS = word.attrib['FEAT'].replace( ' ' , '_' )
        Feats = '_'
        Head = word.attrib['DOM']
        if Head == '_root' :
          Head = '0'
        try :
          DepRel = word.attrib['LINK']
        except KeyError :
          DepRel = 'ROOT'
        PHead = '_'
        PDepRel = '_'
        try:
          if word.attrib['NODETYPE'] == 'FANTOM' :
            word.attrib['LEMMA'] = '*'+word.attrib['LEMMA']+'*'
        except KeyError :
          pass
        print( ID , Form , Lemma , Feats, CPOS , POS , Head , DepRel , PHead , PDepRel , sep='\t' )
      else :
        print( 'WARNING: what is this?',sent.attrib['ID'],word.attrib)
  print()

The underlying issue may be caused by a miss configuration of Linux's locales, meaning that Python is being too cautious when printing non-ASCII chars.潜在的问题可能是由于 Linux 语言环境的错误配置引起的,这意味着 Python 在打印非 ASCII 字符时过于谨慎。

Confirm locale configuration with locale .使用locale确认语言环境配置。 If there's a problem, you'll see something like:如果出现问题,您会看到类似以下内容:

$ locale 
locale: Cannot set LC_CTYPE to default locale: No such file or directory 
locale: Cannot set LC_ALL to default locale: No such file or directory 
LANG=en_US.UTF-8 
LANGUAGE= 

Fix this with:解决这个问题:

$ sudo locale-gen "en_US.UTF-8"

(replace "en_US.UTF-8" with the locale that's not working). (用不起作用的语言环境替换“en_US.UTF-8”)。 For further info, see: https://askubuntu.com/questions/162391/how-do-i-fix-my-locale-issue有关更多信息,请参阅: https : //askubuntu.com/questions/162391/how-do-i-fix-my-locale-issue

You can find important information related to the error you are experiencing in the attributes of the UnicodeError based exception.您可以在基于 UnicodeError 的异常的属性中找到与您遇到的错误相关的重要信息。

Quoting the documentation:引用文档:

UnicodeError has attributes that describe the encoding or decoding error. UnicodeError具有描述编码或解码错误的属性。 For example, err.object[err.start:err.end] gives the particular invalid input that the codec failed on.例如, err.object[err.start:err.end]给出编解码器失败的特定无效输入。

encoding编码

The name of the encoding that raised the error.引发错误的编码的名称。

reason原因

A string describing the specific codec error.描述特定编解码器错误的字符串。

object目的

The object the codec was attempting to encode or decode.编解码器试图编码或解码的对象。

start开始

The first index of invalid data in object.对象中无效数据的第一个索引。

end结尾

The index after the last invalid data in object.对象中最后一个无效数据之后的索引。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM