[英]Python3: Convert Latin-1 to UTF-8
My code looks like the following: 我的代码如下所示:
for file in glob.iglob(os.path.join(dir, '*.txt')):
print(file)
with codecs.open(file,encoding='latin-1') as f:
infile = f.read()
with codecs.open('test.txt',mode='w',encoding='utf-8') as f:
f.write(infile)
The files I work with are encoded in Latin-1 (I could not open them in UTF-8 obviously). 我使用的文件用Latin-1编码(我无法用UTF-8打开它们)。 But I want to write the resulting files in utf-8. 但我想在utf-8中编写生成的文件。
But this: 但是这个:
<Trans audio_filename="VALE_M11_070.MP3" xml:lang="español">
<Datos clave_texto=" VALE_M11_070" tipo_texto="entrevista_semidirigida">
<Corpus corpus="PRESEEA" subcorpus="ESESUMA" ciudad="Valencia" pais="España"/>
Instead becomes this (in gedit): 取而代之的是(在gedit中):
<Trans audio_filename="VALE_M11_070.MP3" xml:lang="espa뇃漀氀∀㸀ഀ㰀䐀愀琀`漀猀 挀氀愀瘀攀开琀攀砀琀漀㴀∀ 嘀䄀䰀䔀开䴀开 㜀
If I print it on the Terminal, it shows up normal. 如果我在终端上打印它,它显示正常。
Even more confusing is what I get when I open the resulting file with LibreOffice Writer: 当我使用LibreOffice Writer打开生成的文件时,我得到的更令人困惑的是:
<#T#r#a#n#s# (and so on)
So how do I properly convert a latin-1 string to a utf-8 string? 那么如何正确地将latin-1字符串转换为utf-8字符串? In python2, it's easy, but in python3, it seems confusing to me. 在python2中,它很容易,但在python3中,它似乎让我很困惑。
I tried already these in different combinations: 我尝试过这些不同的组合:
#infile = bytes(infile,'utf-8').decode('utf-8')
#infile = infile.encode('utf-8').decode('utf-8')
#infile = bytes(infile,'utf-8').decode('utf-8')
But somehow I always end up with the same weird output. 但不知怎的,我总是以同样奇怪的输出结束。
Thanks in advance! 提前致谢!
Edit: This question is different to the questions linked in the comment, as it concerns Python 3, not Python 2.7. 编辑:这个问题与评论中链接的问题不同,因为它涉及Python 3,而不是Python 2.7。
I have found a half-part way in this. 我找到了半个方法。 This is not what you want / need, but might help others in the right direction... 这不是你想要/需要的,但可以帮助其他人朝着正确的方向......
# First read the file
txt = open("file_name", "r", encoding="latin-1") # r = read, w = write & a = append
items = txt.readlines()
txt.close()
# and write the changes to file
output = open("file_name", "w", encoding="utf-8")
for string_fin in items:
if "é" in string_fin:
string_fin = string_fin.replace("é", "é")
if "ë" in string_fin:
string_fin = string_fin.replace("ë", "ë")
# this works if not to much needs changing...
output.write(string_fin)
output.close();
对于python 3.6:
your_str = your_str.encode('utf-8').decode('latin-1')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.