简体   繁体   English

在python中将文件从latin1转换为utf-8的最快方法

[英]Fastest way to convert file from latin1 to utf-8 in python

I need fastest way to convert files from latin1 to utf-8 in python. 我需要在python中将文件从latin1转换为utf-8的最快方法。 The files are large ~ 2G. 文件很大~2G。 ( I am moving DB data ). (我正在移动数据库数据)。 So far I have 到目前为止我有

import codecs
infile = codecs.open(tmpfile, 'r', encoding='latin1')
outfile = codecs.open(tmpfile1, 'w', encoding='utf-8')
for line in infile:
     outfile.write(line)
infile.close()
outfile.close()

but it is still slow. 但它仍然很慢。 The conversion takes one fourth of the whole migration time. 转换占整个迁移时间的四分之一。

I could also use a linux command line utility if it is faster than native python code. 如果它比本机python代码更快,我也可以使用linux命令行实用程序。

我会选择iconv和系统调用。

You could use blocks larger than one line, and do binary I/O -- each might speed thinks up a bit (though on Linux binary I/O won't, as it's identical to text I/O): 您可以使用大于一行的块,并执行二进制I / O - 每个可能加速思考(尽管在Linux二进制I / O上不会,因为它与文本I / O相同):

 BLOCKSIZE = 1024*1024
 with open(tmpfile, 'rb') as inf:
   with open(tmpfile, 'wb') as ouf:
     while True:
       data = inf.read(BLOCKSIZE)
       if not data: break
       converted = data.decode('latin1').encode('utf-8')
       ouf.write(converted)

The byte-by-byte parsing implied in by-line reading, line-end conversion (not on Linux;-), and codecs.open-style encoding-decoding, should be part of what's slowing you down. 在线阅读,线端转换(不在Linux ;-)和codecs.open风格的编码解码中隐含的逐字节解析应该是减慢你速度的一部分。 This approach is also portable (like yours is), since control-characters such as \\n need no translation among these codecs anyway (in any OS). 这种方法也是可移植的(就像你的那样),因为像\\n这样的控制字符无论如何都不需要在这些编解码器之间进行翻译(在任何操作系统中)。

This only works for input codecs that have no multibyte characters, but `latin1' is one of those (it does not matter whether the output codec has such characters or not). 这仅适用于没有多字节字符的输入编解码器,但`latin1'是其中之一(输出编解码器是否具有此类字符无关紧要)。

Try different block sizes to find the sweet spot performance-wise, depending on your disk, filesystem and available RAM. 根据您的磁盘,文件系统和可用RAM,尝试使用不同的块大小来查找最佳性能。

Edit : changed code per @John's comment, and clarified a conditon as per @gnibbler's. 编辑 :根据@ John的评论更改了代码,并按照@ gnibbler的说明澄清了一个条件。

If you are desperate to do it in Python (or any other language), at least do the I/O in bigger chunks than lines, and avoid the codecs overhead. 如果你不顾一切地用Python(或任何其他语言)来做,至少在更大的块中进行I / O而不是行,并避免编解码器开销。

infile = open(tmpfile, 'rb')
outfile = open(tmpfile1, 'wb')
BLOCKSIZE = 65536 # experiment with size
while True:
    block = infile.read(BLOCKSIZE)
    if not block: break
    outfile.write(block.decode('latin1').encode('utf8'))
infile.close()
outfile.close()

Otherwise, go with iconv ... I haven't look under the hood but if it doesn't special-case latin1 input I'd be surprised :-) 否则,请使用iconv ...我没有看看引擎盖,但如果没有特殊情况latin1输入我会感到惊讶:-)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM