[英]Decode non-ascii characters in an ascii file?
I'm parsing a file which is in ascii format but includes non-ascii characters in big5 (Trad. Chinese). 我正在解析一个ascii格式的文件,但在big5(Trad。中文)中包含非ascii字符。
For details is a CWR file from CISAC. 有关详细信息,请参阅CISAC的CWR文件。
I'm trying to decode the non-ascii characters unsuccesfully. 我试图不成功地解码非ascii字符。 Here an example line:
这是一个示例行:
NWN000003930000016400507347 ^N&ÊÅ+/{^O
From position 29 to 188 should be encoded in big5. 从位置29到188应该用big5编码。
#!/usr/bin/python
# -*- coding: utf-8 -*-
import os
import sys
import binascii
from chardet.universaldetector import UniversalDetector
from chardet import detect
with open("/path/to/file") as fd:
line = fd.readline()
while line:
if line[0:3] == 'NWN':
last_name = line[29:188]
print last_name
print detect(line)['encoding']
print last_name.decode('big5')
line = fd.readline()
However, the result I get for the row above is: 但是,我得到的上述行的结果是:
None
&岒+/{
And for the following row: 对于以下行:
NWN000000140000016300401453 ^N/õ<Dï.^O
even crashes: 甚至崩溃:
windows-1252
Traceback (most recent call last):
File "test_big5.py", line 36, in <module>
print last_name.decode('big5')
UnicodeDecodeError: 'big5' codec can't decode bytes in position 1-2: illegal multibyte sequence
I also tried as follows: 我也试过如下:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import sys
from codecs import EncodedFile
from_encoding = 'big5'
to_encoding = 'utf8'
sys.stdout = EncodedFile(sys.stdout, from_encoding, to_encoding)
f = file("/path/to/file", "r")
str = f.read()
sys.stdout.write(str)
I attach a sample file here 我在这里附上一个示例文件
Any idea about what I'm doing wrong? 关于我做错了什么的任何想法?
You should be able to read the file with the big5
codec. 您应该能够使用
big5
编解码器读取该文件。 When trying it, I got 在尝试时,我得到了
>>> import codecs
>>> codecs.open('nwn.file', encoding="big5").read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/codecs.py", line 668, in read
return self.reader.read(size)
UnicodeDecodeError: 'big5' codec can't decode bytes in position 1790-1791: illegal multibyte sequence
The lines in your file are pretty long, so I read them into a list (no codecs, just open the file in "rb" mode and readlines()) and trimmed out whitespace. 你文件中的行很长,所以我把它们读成一个列表(没有编解码器,只需在“rb”模式和readlines()中打开文件)并修剪掉空格。 Now I can use this list as a runnable example.
现在我可以将此列表用作可运行的示例。 This is what I was getting at when I suggested you post data from the file read in binary mode.
当我建议你从以二进制模式读取的文件中发布数据时,这就是我所得到的。
test = [
b'NWN000003930000016400507347 \x0e&\xca\xc5+/{\x0f ZH\r\n'
b'NWN000003960000016400507347 \x0e&\xca\xc5+/{\x0f ZH\r\n'
b'NWN000005660000046800507347 \x0e&\xca\xc5+/{\x0f ZH\r\n'
b'NWN000016200000016400507347 \x0e&\xca\xc5+/{\x0f ZH\r\n'
b'NWN000025600000016400507347 \x0e&\xca\xc5+/{\x0f ZH\r\n'
b'NWN000000140000016300401453 \x0e/\xf5<D\xef.\x0f ZH\r\n'
]
Then I did the decode line by line. 然后我逐行解码。 Instead of the default
errors='strict'
, I used replace
to see what's going on. 而不是默认的
errors='strict'
,我使用replace
来看看发生了什么。 Those &岒+/{
are a bit odd, but then I don't know what this file is. 那些
&岒+/{
有点奇怪,但后来我不知道这个文件是什么。 Notice the question marks are the final line. 请注意,问号是最后一行。 There are non-big8 sequences.
有非big8序列。 This file is corrupt.
此文件已损坏。
>>> for line in test:
... print line.strip().decode('big5', errors='replace')
...
NWN000003930000016400507347 &岒+/{ ZH
NWN000003960000016400507347 &岒+/{ ZH
NWN000005660000046800507347 &岒+/{ ZH
NWN000016200000016400507347 &岒+/{ ZH
NWN000025600000016400507347 &岒+/{ ZH
NWN000000140000016300401453 /�D� ZH
If you want most of the data, you could decode line by line like my example and catch that error. 如果您需要大部分数据,您可以像我的示例一样逐行解码并捕获该错误。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.