简体   繁体   English

如何在Python中将文本文件(包含非英语语言的文本)的编码从“ UTF-16 LE”转换为“ UTF-8”?

[英]How to convert encoding of text file (which contains text of language other than English) from “UTF-16 LE” to “UTF-8” in Python?

I have few text files which contain text in Hindi language in a folder. 我的几个文本文件在一个文件夹中包含印地语语言的文本。 But those text files are in UTF-16 LE Encoding. 但是这些文本文件采用UTF-16 LE编码。 I want to change the encoding to UTF-8 without changing text in it. 我想将编码更改为UTF-8而不更改其中的文本。 How can I do that? 我怎样才能做到这一点?

I wrote two python files but none of them are working proprely. 我写了两个python文件,但是它们都不能正常工作。 When I run any of them, along with changing the encoding, they clear the file content. 当我运行它们中的任何一个,以及更改编码时,它们都会清除文件内容。 These are code in my Python files: 这些是我的Python文件中的代码:

File 1: 文件1:

import os
for root, dirs, files in os.walk("."):  
    for filename in files:
        #print(filename[-4:])
        if(filename[-3:] == "txt"):
            f= open(filename,"w+")
            x = f.read()
            print(x)
            f.close()
            f1= open(filename, "w+", encoding="utf-8")
            f1.write(x)
            f1.close()

File 2: 档案2:

import codecs
BLOCKSIZE = 1048576
with codecs.open("ee.txt", "r", "utf-16-le") as sourceFile:
    with codecs.open("ee.txt", "w", "utf-8") as targetFile:
        while True:
            contents = sourceFile.read(BLOCKSIZE)
            print(contents)
            if not contents:
                break
            targetFile.write(contents)

You are not specifying the files are in utf-16 LE when reading the contents - that, and there is this confusion of trying to read and write to the same file at the same time, which won't work. 您在读取内容时未指定文件位于utf-16 LE中,那样的话,试图同时读取和写入同一文件会有这种混乱,这将无法正常工作。

Also, unless you are running this code in a server where an attack attempt may be made by sending you an inordinately big text file, you should not worry about file size, and just read all file contents at once. 另外,除非您在服务器上运行此代码,在该服务器上可能通过向您发送过大的文本文件来进行攻击尝试,否则您不必担心文件大小,只需立即读取所有文件内容。 (For you to have an idea, the Bible which is a big book is on the order of 3 MB in size (with 8bit encoding) - and even small VPS servers will have at on the order of 200MB of memory available to your program - that is, you could convert a book the size of 30+ bibles at once). (为了使您有个主意,圣经是一本大书,大小约为3 MB(使用8位编码)-甚至小型VPS服务器也将为程序提供200 MB的可用内存-也就是说,您可以一次将一本大小超过30的圣经转换成一本书)。 Typical desktop computers will have several times this amount of memory. 典型的台式计算机将具有此内存量的几倍。

Also, the relatively recent "pathlib" Python library can ease terating through all your text files, and its Path.read_text and Path.write_text methods will open a file, read or write the contents in the correct encoding, and close it in a single expression. 此外,相对较新的“ pathlib” Python库可以轻松终止所有文本文件,其Path.read_textPath.write_text方法将打开文件,以正确的编码方式读取或写入内容,并以单个方式将其关闭表达。 Since when using this method, at time of writting the file the reading will be already done, we can simply do it with two calls: 由于使用此方法时,在写入文件时已经完成读取,因此我们可以通过两个调用简单地完成读取:

import pathlib
for filepath in pathlib.Path(".").glob("**/*.txt"):
   data = filepath.read_text(encoding="utf-16 LE")
   filepath.write_text(data, encoding="utf-8")

If you prefer to be on the safe side, on the very, very unlikely of a catastrophic computer crash on the middle of a file conversion, you could write to a diffrently named file, and do the deleting/rename afterwards - so the code is like this: 如果您希望安全起见,在文件转换过程中极不可能发生灾难性的计算机崩溃,则可以写入一个不同名称的文件,然后再进行删除/重命名-因此代码是像这样:

import pathlib
for filepath in pathlib.Path(".").glob("**/*.txt"):
   data = filepath.read_text(encoding="utf-16 LE")
   tmp_name = filepath.name + ".tmp"
   filepath.with_name(tmp_name).write_text(data, encoding="utf-8")
   filepath.unlink()
   filepath.with_name(tmp_name).rename(filepath.name)

Before to explain you what it is wrong two useful tips: 在向您解释错误之前,有两个有用的提示:

I think you should remove the print. 我认为您应该删除打印件。 It will just confuse you, and it depends on the operating system and environment what encoding it will print. 它只会使您感到困惑,并且取决于操作系统和环境它将打印哪种编码。

Try with a very short file (few character) and check the input and output of both files either as text and as bytes. 尝试使用一个非常短的文件(字符很少),并检查两个文件的输入和输出是否为文本和字节。

Now the solution: 现在的解决方案:

On the first example: you should open the first file as read ( r ). 在第一个示例中:您应将第一个文件打开为read( r )。

On second example: you open the same file, first step to read but before you read the file you open it to write, so you truncate the file, and you will have no characters to read. 在第二个示例中:打开同一文件,第一步是读取文件,但是在读取文件之前,先打开文件进行写入,因此将其截断,将没有字符可读取。

Use a ee.txt.tmp file to write, and at the end, if there are no error, you can move the tmp file removing the .tmp prefix. 使用ee.txt.tmp文件进行写入,最后,如果没有错误,则可以移动tmp文件,删除.tmp前缀。

In general: never read and write on the same file. 通常,切勿在同一文件上读写。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM