如何从Windows-1252文件读取®字符并将其写入UTF-8文件

Question

I have an input file in Windows-1252 encoding that contains the '®' character. 我有Windows-1252编码的输入文件，其中包含'®'字符。 I need to write this character to a UTF-8 file. 我需要将此字符写入UTF-8文件。 Also assume I must use Python 2.7. 还要假设我必须使用Python 2.7。 Seems easy enough, but I keep getting UnicodeDecodeErrors. 似乎很容易，但我一直收到UnicodeDecodeErrors。

I originally had just opened the original file using codecs.open() with UTF-8 encoding, which worked fine for all of the ASCII characters until it encountered the ® symbol, whereupon it choked with the error: 我最初只是使用带有UTF-8编码的codecs.open()打开了原始文件，该文件对于所有ASCII字符都可以正常工作，直到遇到®符号，随后它因错误而阻塞：

UnicodeDecodeError: 'utf8' codec can't decode byte 0xae in position 2867043: 
invalid start byte

I knew that I would have to properly decode it as cp1252 to fix this problem, so I opened it in the proper encoding and then encoded the data as UTF-8 prior to writing. 我知道我必须将其正确解码为cp1252才能解决此问题，因此我以正确的编码将其打开，然后在写入之前将数据编码为UTF-8。 But that produced a new error: 但这产生了一个新的错误：

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 22: 
ordinal not in range(128)

Here is a minimum working example: 这是一个最小的工作示例：

with codecs.open('in.txt', mode='rb', encoding='cp1252') as inf:
    with codecs.open('out.txt', mode='wb', encoding='utf-8') as of:
        for line in inf:
            of.write(line.encode('utf-8'))

Here is the contents of in.txt : 这是in.txt的内容：

Sample file

Here is my sample file® yay.

I thought perhaps I could just open it in 'rb' mode with no encoding specified and specifically handle the decoding and encoding of each line like so: 我想也许我可以在未指定编码的情况下以“ rb”模式打开它，并专门处理每一行的解码和编码，如下所示：

of.write(line.decode('cp1252').encode('utf-8'))

But that also didn't work, giving the same error as when I just opened it as UTF-8. 但这也行不通，产生了与我刚以UTF-8打开它时相同的错误。

How do I read data from a Windows-1252 file, properly decode it then encode it as UTF-8 and write it to a UTF-8 file? 如何从Windows-1252文件中读取数据，对其进行正确解码，然后将其编码为UTF-8，然后将其写入UTF-8文件？ The above method has always worked for me in the past until I encountered the ® character. 在我遇到®字符之前，以上方法一直对我有用。

Answer 1

Your file is not in Windows-1252 if 0xC2 should represent the ® character; 如果0xC2应该代表®字符，则您的文件不在Windows-1252中；否则，您的文件不在Windows-1252中。 in Windows-1252, 0xC2 is Â . 在Windows 1252，为0xC2是Â 。

However, you should just use 但是，您应该只使用

of.write(line)

since encoding properly is the whole reason you're using codecs in the first place. 因为正确编码是您首先使用codecs的全部原因。

如何从Windows-1252文件读取®字符并将其写入UTF-8文件

问题描述

1 个解决方案

解决方案1
1 已采纳 2015-10-14 15:31:37

如何从Windows-1252文件读取®字符并将其写入UTF-8文件

问题描述

1 个解决方案

解决方案1 1 已采纳 2015-10-14 15:31:37

解决方案1
1 已采纳 2015-10-14 15:31:37