简体   繁体   English

UnicodeError从文件中读取Accentuation葡萄牙语字符

[英]UnicodeError Reading Accentuation Portuguese Characters from File

Preface: 前言:

It's a cold, rainy day, in mid 2016, and a developer is still having encode issues with python for not using Python 3.0. 在2016年中期,这是一个阴雨天,开发人员仍因未使用Python 3.0而遇到python编码问题。 Will the great SO community help him ? 伟大的SO社区会帮助他吗? I don't know, we will have to wait and see 我不知道,我们将不得不拭目以待

Scope: 范围:

I have a UTF-8 encoded file that contains words with accentuation, such as CURRÍCULO and NÓS . 我有一个UTF-8 编码文件 ,其中包含带有重音的单词,例如CURRÍCULONÓS For some reason I can't grasp, I can't manage to read them properly using Python 2.7. 由于某种原因,我无法掌握,因此无法使用Python 2.7正确读取它们。

Code Snippet: 代码段:

import codecs

f_reader = codecs.open('PATH_TO_FILE/Data/Input/kw.txt', 'r', encoding='utf-8')

for line in f_reader:
    keywords.append(line.strip().upper())
    print line

The output I get is: 我得到的输出是:

TRABALHE CONOSCO
ENVIE SEU CURRICULO
ENVIE SEU CURRÍCULO  
UnicodeEncodeError, 'ascii' codec can't encode character u'\xcd' in position 14: ordinal not in range(128)

Encoding, Encoding, Encoding: 编码,编码,编码:

I have used notepad++ to convert the file to both regular utf-8 and the one without the ByteOrderMark, and it shows me the characters just fine, without any issue. 我已经使用notepad ++将文件转换为常规utf-8和不带ByteOrderMark的文件,它向我显示了字符,没有任何问题。 I'm using Windows, by the way, which will create files as ANSI by default. 顺便说一句,我使用的是Windows,默认情况下将以ANSI格式创建文件。

Question: 题:

What should I do to be able to read this file properly, including the í and ó and other accentuated characters ? 我应该怎么做才能正确读取此文件,包括íó及其他强调字符?

Just to make it clearer, I want to keep the accentuation on the strings I use in memory. 为了更清楚一点,我想keep the accentuation在内存中使用的字符串。

Update: 更新:

Here's the List of Keywords, in memory, read from the file using the code you can see. 这是内存中的关键字列表,使用您可以看到的代码从文件中读取。

内存中读取的关键字列表

The problem seems not to be in the reading, but in the printing. 问题似乎不在于阅读,而在于印刷。 You sad 你难过

I'm using Windows, by the way, which will create files as ANSI by default. 顺便说一句,我使用的是Windows,默认情况下将以ANSI格式创建文件。

I think that includes printing to stdout . 我认为这包括打印到stdout Try change the sys.output codec: 尝试更改sys.output编解码器:

sys.stdout = codecs.getwriter("utf-8")(sys.stdout)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM