繁体   English   中英

Python:如何从 Windows 1251 转换为 Unicode?

[英]Python: how to convert from Windows 1251 to Unicode?

我正在尝试使用 Python 将文件内容从 Windows-1251(西里尔字母)转换为 Unicode。 我找到了这个 function,但它不起作用。

#!/usr/bin/env python

import os
import sys
import shutil

def convert_to_utf8(filename):
# gather the encodings you think that the file may be
# encoded inside a tuple
encodings = ('windows-1253', 'iso-8859-7', 'macgreek')

# try to open the file and exit if some IOError occurs
try:
    f = open(filename, 'r').read()
except Exception:
    sys.exit(1)

# now start iterating in our encodings tuple and try to
# decode the file
for enc in encodings:
    try:
        # try to decode the file with the first encoding
        # from the tuple.
        # if it succeeds then it will reach break, so we
        # will be out of the loop (something we want on
        # success).
        # the data variable will hold our decoded text
        data = f.decode(enc)
        break
    except Exception:
        # if the first encoding fail, then with the continue
        # keyword will start again with the second encoding
        # from the tuple an so on.... until it succeeds.
        # if for some reason it reaches the last encoding of
        # our tuple without success, then exit the program.
        if enc == encodings[-1]:
            sys.exit(1)
        continue

# now get the absolute path of our filename and append .bak
# to the end of it (for our backup file)
fpath = os.path.abspath(filename)
newfilename = fpath + '.bak'
# and make our backup file with shutil
shutil.copy(filename, newfilename)

# and at last convert it to utf-8
f = open(filename, 'w')
try:
    f.write(data.encode('utf-8'))
except Exception, e:
    print e
finally:
    f.close()

我怎样才能做到这一点?

谢谢

import codecs

f = codecs.open(filename, 'r', 'cp1251')
u = f.read()   # now the contents have been transformed to a Unicode string
out = codecs.open(output, 'w', 'utf-8')
out.write(u)   # and now the contents have been output as UTF-8

这是你打算做的吗?

如果您使用codecs模块打开文件,它会在您读取文件时为您转换为 Unicode。 例如:

import codecs
f = codecs.open('input.txt', encoding='cp1251')
assert isinstance(f.read(), unicode)

这仅在您使用 Python 中的文件数据时才有意义。 如果您尝试在文件系统上将文件从一种编码转换为另一种编码(这是您发布的脚本尝试执行的操作),您必须指定实际编码,因为您不能在 "统一码”。

这只是一个猜测,因为您没有具体说明“不起作用”是什么意思。

如果文件生成正确但似乎包含乱码,则您正在查看它的应用程序可能无法识别它包含 UTF-8。 您需要在文件的开头添加一个 BOM - 3 个字节0xEF,0xBB,0xBF (未编码)。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM