如何在Python 3中使用Unicode读取大文件

Question

Hello i have a large file that contain unicode characters, and when i try to open it in Python 3 this is the mistake i have. 您好，我有一个包含Unicode字符的大文件，当我尝试在Python 3中打开它时，这是我的错误。

File "addRNC.py", line 47, in add_rnc() add_rnc（）中的文件“ addRNC.py”，第47行

File "addRNC.py", line 13, in init for value in rawDoc.readline(): 文件“ addRNC.py”，第13行，在init中获取rawDoc.readline（）中的值：

File "/usr/local/lib/python3.1/codecs.py", line 300, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) 文件“ /usr/local/lib/python3.1/codecs.py”，第300行，解码（结果，已消耗）= self._buffer_decode（数据，self.errors，最终值）

UnicodeDecodeError: 'utf8' codec can't decode byte 0xd3 in position 158: invalid continuation byte UnicodeDecodeError：'utf8'编解码器无法解码位置158中的字节0xd3：无效的连续字节

And i try everything and didn't work, here is the code: 我尝试了所有方法，但没有成功，下面是代码：

rawDoc = io.open("/root/potential/rnc_lst.txt", 'r', encoding='utf8')
    result = []
    for value in rawDoc.readline():

        if len(value.split('|')[9]) > 0 and len(value.split('|')[10]) > 0: 
            if value.split('|')[9] == 'ACTIVO' and value.split('|')[10] == 'NORMAL':
                address = ''
                for piece in value.split('|')[4:7]:
                    address += piece
                if value.split('|')[8] != '':
                    rawdate = value.split('|')[8].split('/')
                    _date = rawdate[2]+"-"+rawdate[1]+"-"+rawdate[0]
                else:
                    _date = 'NULL'

                id = db.prepare("SELECT id FROM potentials_reg WHERE(rnc = '%s')"%(value.split('|')[0]))()

                if len(id) == 0:
                    if _date == 'NULL':
                        db.prepare("INSERT INTO potentials_reg (rnc, _name, _owner, work_type, address, telephone, constitution, active)"+ 
                                "VALUES('%s', '%s', '%s', '%s', '%s', '%s', NULL, '%s')"%(value.split('|')[0], value.split('|')[1], 
                                                                        value.split('|')[2],value.split('|')[3],address, 
                                                                        value.split('|')[7], 'true'))()
                    else:
                        db.prepare("INSERT INTO potentials_reg (rnc, _name, _owner, work_type, address, telephone, constitution, active)"+ 
                                "VALUES('%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s')"%(value.split('|')[0], value.split('|')[1], 
                                                                        value.split('|')[2],value.split('|')[3],address, 
                                                                        value.split('|')[7],_date, 'true'))()
                else:
                    pass

    db.close()

Answer 1

Your file actually contains invalid UTF-8. 您的文件实际上包含无效的UTF-8。

When you say "contains unicode characters", you should be aware that Unicode doesn't specify how the characters are represented. 当您说“包含Unicode字符”时，应注意Unicode并未指定字符的表示方式。 So even if the file represents Unicode data , it could be in UTF-8, UTF-16 (UTF-16BE or UTF-16LE, each with or without a BOM), the deprecated UCS-2, or perhaps even one of the more esoteric forms... 因此， 即使文件代表Unicode数据 ，它也可能采用UTF-8，UTF-16（UTF-16BE或UTF-16LE，每个都有或没有BOM），不推荐使用的UCS-2，甚至可能是其中的一种深奥的形式...

Double check that the file is valid; 仔细检查文件是否有效； I'd bet that you indeed have a byte 0xD3 (11010011), which must in UTF-8 be the leading byte of a two-byte character, in a follower position (in other words, 0xD3 immediately follows a byte whose binary representation begins with 11 [is greater than 0xC0]). 我敢打赌，您确实有一个字节0xD3（11010011），该字节必须在UTF-8中是两个字节字符的开头字节，并且位于跟随位置（换句话说，0xD3紧随其二进制表示开始的字节之后） 11 [大于0xC0]）。

The most likely reason for this is that your file contains non-ASCII characters, but isn't in UTF-8. 造成这种情况的最可能原因是您的文件包含非ASCII字符，但不在UTF-8中。

如何在Python 3中使用Unicode读取大文件

问题描述

1 个解决方案

解决方案1
5 2012-02-01 04:51:03

如何在Python 3中使用Unicode读取大文件

问题描述

1 个解决方案

解决方案1 5 2012-02-01 04:51:03

解决方案1
5 2012-02-01 04:51:03