简体   繁体   English

如何从Windows-1252文件读取®字符并将其写入UTF-8文件

[英]How to read ® character from Windows-1252 file and write to UTF-8 file

I have an input file in Windows-1252 encoding that contains the '®' character. 我有Windows-1252编码的输入文件,其中包含'®'字符。 I need to write this character to a UTF-8 file. 我需要将此字符写入UTF-8文件。 Also assume I must use Python 2.7. 还要假设我必须使用Python 2.7。 Seems easy enough, but I keep getting UnicodeDecodeErrors. 似乎很容易,但我一直收到UnicodeDecodeErrors。

I originally had just opened the original file using codecs.open() with UTF-8 encoding, which worked fine for all of the ASCII characters until it encountered the ® symbol, whereupon it choked with the error: 我最初只是使用带有UTF-8编码的codecs.open()打开了原始文件,该文件对于所有ASCII字符都可以正常工作,直到遇到®符号,随后它因错误而阻塞:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xae in position 2867043: 
invalid start byte

I knew that I would have to properly decode it as cp1252 to fix this problem, so I opened it in the proper encoding and then encoded the data as UTF-8 prior to writing. 我知道我必须将其正确解码为cp1252才能解决此问题,因此我以正确的编码将其打开,然后在写入之前将数据编码为UTF-8。 But that produced a new error: 但这产生了一个新的错误:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 22: 
ordinal not in range(128)

Here is a minimum working example: 这是一个最小的工作示例:

with codecs.open('in.txt', mode='rb', encoding='cp1252') as inf:
    with codecs.open('out.txt', mode='wb', encoding='utf-8') as of:
        for line in inf:
            of.write(line.encode('utf-8'))

Here is the contents of in.txt : 这是in.txt的内容:

Sample file

Here is my sample file® yay.

I thought perhaps I could just open it in 'rb' mode with no encoding specified and specifically handle the decoding and encoding of each line like so: 我想也许我可以在未指定编码的情况下以“ rb”模式打开它,并专门处理每一行的解码和编码,如下所示:

of.write(line.decode('cp1252').encode('utf-8'))

But that also didn't work, giving the same error as when I just opened it as UTF-8. 但这也行不通,产生了与我刚以UTF-8打开它时相同的错误。

How do I read data from a Windows-1252 file, properly decode it then encode it as UTF-8 and write it to a UTF-8 file? 如何从Windows-1252文件中读取数据,对其进行正确解码,然后将其编码为UTF-8,然后将其写入UTF-8文件? The above method has always worked for me in the past until I encountered the ® character. 在我遇到®字符之前,以上方法一直对我有用。

Your file is not in Windows-1252 if 0xC2 should represent the ® character; 如果0xC2应该代表®字符,则您的文件不在Windows-1252中;否则,您的文件不在Windows-1252中。 in Windows-1252, 0xC2 is  . 在Windows 1252,为0xC2是Â

However, you should just use 但是,您应该只使用

of.write(line)

since encoding properly is the whole reason you're using codecs in the first place. 因为正确编码是您首先使用codecs的全部原因。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM