在python中将文件从latin1转换为utf-8的最快方法

Question

I need fastest way to convert files from latin1 to utf-8 in python. 我需要在python中将文件从latin1转换为utf-8的最快方法。 The files are large ~ 2G. 文件很大~2G。 ( I am moving DB data ). （我正在移动数据库数据）。 So far I have 到目前为止我有

import codecs
infile = codecs.open(tmpfile, 'r', encoding='latin1')
outfile = codecs.open(tmpfile1, 'w', encoding='utf-8')
for line in infile:
     outfile.write(line)
infile.close()
outfile.close()

but it is still slow. 但它仍然很慢。 The conversion takes one fourth of the whole migration time. 转换占整个迁移时间的四分之一。

I could also use a linux command line utility if it is faster than native python code. 如果它比本机python代码更快，我也可以使用linux命令行实用程序。

Answer 1

我会选择iconv和系统调用。

Answer 2

You could use blocks larger than one line, and do binary I/O -- each might speed thinks up a bit (though on Linux binary I/O won't, as it's identical to text I/O): 您可以使用大于一行的块，并执行二进制I / O - 每个可能加速思考（尽管在Linux二进制I / O上不会，因为它与文本I / O相同）：

 BLOCKSIZE = 1024*1024
 with open(tmpfile, 'rb') as inf:
   with open(tmpfile, 'wb') as ouf:
     while True:
       data = inf.read(BLOCKSIZE)
       if not data: break
       converted = data.decode('latin1').encode('utf-8')
       ouf.write(converted)

The byte-by-byte parsing implied in by-line reading, line-end conversion (not on Linux;-), and codecs.open-style encoding-decoding, should be part of what's slowing you down. 在线阅读，线端转换（不在Linux ;-)和codecs.open风格的编码解码中隐含的逐字节解析应该是减慢你速度的一部分。 This approach is also portable (like yours is), since control-characters such as \\n need no translation among these codecs anyway (in any OS). 这种方法也是可移植的（就像你的那样），因为像\\n这样的控制字符无论如何都不需要在这些编解码器之间进行翻译（在任何操作系统中）。

This only works for input codecs that have no multibyte characters, but `latin1' is one of those (it does not matter whether the output codec has such characters or not). 这仅适用于没有多字节字符的输入编解码器，但`latin1'是其中之一（输出编解码器是否具有此类字符无关紧要）。

Try different block sizes to find the sweet spot performance-wise, depending on your disk, filesystem and available RAM. 根据您的磁盘，文件系统和可用RAM，尝试使用不同的块大小来查找最佳性能。

Edit : changed code per @John's comment, and clarified a conditon as per @gnibbler's. 编辑：根据@ John的评论更改了代码，并按照@ gnibbler的说明澄清了一个条件。

Answer 3

If you are desperate to do it in Python (or any other language), at least do the I/O in bigger chunks than lines, and avoid the codecs overhead. 如果你不顾一切地用Python（或任何其他语言）来做，至少在更大的块中进行I / O而不是行，并避免编解码器开销。

infile = open(tmpfile, 'rb')
outfile = open(tmpfile1, 'wb')
BLOCKSIZE = 65536 # experiment with size
while True:
    block = infile.read(BLOCKSIZE)
    if not block: break
    outfile.write(block.decode('latin1').encode('utf8'))
infile.close()
outfile.close()

Otherwise, go with iconv ... I haven't look under the hood but if it doesn't special-case latin1 input I'd be surprised :-) 否则，请使用iconv ...我没有看看引擎盖，但如果没有特殊情况latin1输入我会感到惊讶:-)

在python中将文件从latin1转换为utf-8的最快方法

问题描述

3 个解决方案

解决方案1
6

解决方案2
4 已采纳 2010-03-08 22:02:30

解决方案3
2 2010-03-08 22:06:24

在python中将文件从latin1转换为utf-8的最快方法

问题描述

3 个解决方案

解决方案1 6

解决方案2 4 已采纳 2010-03-08 22:02:30

解决方案3 2 2010-03-08 22:06:24

解决方案1
6

解决方案2
4 已采纳 2010-03-08 22:02:30

解决方案3
2 2010-03-08 22:06:24