[英]How to read ® character from Windows-1252 file and write to UTF-8 file
I have an input file in Windows-1252 encoding that contains the '®' character. 我有Windows-1252编码的输入文件,其中包含'®'字符。 I need to write this character to a UTF-8 file.
我需要将此字符写入UTF-8文件。 Also assume I must use Python 2.7.
还要假设我必须使用Python 2.7。 Seems easy enough, but I keep getting UnicodeDecodeErrors.
似乎很容易,但我一直收到UnicodeDecodeErrors。
I originally had just opened the original file using codecs.open()
with UTF-8 encoding, which worked fine for all of the ASCII characters until it encountered the ® symbol, whereupon it choked with the error: 我最初只是使用带有UTF-8编码的
codecs.open()
打开了原始文件,该文件对于所有ASCII字符都可以正常工作,直到遇到®符号,随后它因错误而阻塞:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xae in position 2867043:
invalid start byte
I knew that I would have to properly decode it as cp1252 to fix this problem, so I opened it in the proper encoding and then encoded the data as UTF-8 prior to writing. 我知道我必须将其正确解码为cp1252才能解决此问题,因此我以正确的编码将其打开,然后在写入之前将数据编码为UTF-8。 But that produced a new error:
但这产生了一个新的错误:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 22:
ordinal not in range(128)
Here is a minimum working example: 这是一个最小的工作示例:
with codecs.open('in.txt', mode='rb', encoding='cp1252') as inf:
with codecs.open('out.txt', mode='wb', encoding='utf-8') as of:
for line in inf:
of.write(line.encode('utf-8'))
Here is the contents of in.txt
: 这是
in.txt
的内容:
Sample file
Here is my sample file® yay.
I thought perhaps I could just open it in 'rb' mode with no encoding specified and specifically handle the decoding and encoding of each line like so: 我想也许我可以在未指定编码的情况下以“ rb”模式打开它,并专门处理每一行的解码和编码,如下所示:
of.write(line.decode('cp1252').encode('utf-8'))
But that also didn't work, giving the same error as when I just opened it as UTF-8. 但这也行不通,产生了与我刚以UTF-8打开它时相同的错误。
How do I read data from a Windows-1252 file, properly decode it then encode it as UTF-8 and write it to a UTF-8 file? 如何从Windows-1252文件中读取数据,对其进行正确解码,然后将其编码为UTF-8,然后将其写入UTF-8文件? The above method has always worked for me in the past until I encountered the ® character.
在我遇到®字符之前,以上方法一直对我有用。
Your file is not in Windows-1252 if 0xC2 should represent the ®
character; 如果0xC2应该代表
®
字符,则您的文件不在Windows-1252中;否则,您的文件不在Windows-1252中。 in Windows-1252, 0xC2 is Â
. 在Windows 1252,为0xC2是
Â
。
However, you should just use 但是,您应该只使用
of.write(line)
since encoding properly is the whole reason you're using codecs
in the first place. 因为正确编码是您首先使用
codecs
的全部原因。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.