简体   繁体   中英

How to open a file with utf-8 non encoded characters?

I want to open a text file (.dat) in python and I get the following error: 'utf-8' codec can't decode byte 0x92 in position 4484: invalid start byte but the file is encoded using utf-8, so maybe there some character that cannot be read. I am wondering, is there a way to handle the problem without calling each single weird characters? Cause I have a rather huge text file and it would take me hours to run find the non encoded Utf-8 encoded character.

Here is my code

import codecs
f = codecs.open('compounds.dat', encoding='utf-8')
for line in f:
    if "InChI=1S/C11H8O3/c1-6-5-9(13)10-7(11(6)14)3-2-4-8(10)12/h2-5" in line:
        print(line)
searchfile.close()

It shouldn't "take you hours" to find the bad byte. The error tells you exactly where it is; it's at index 4484 in your input with a value of 0x92 ; if you did:

with open('compounds.dat', 'rb') as f:
    data = f.read()

the invalid byte would be at data[4484] , and you can slice as you like to figure out what's around it.

In any event, if you just want to ignore or replace invalid bytes, that's what the errors parameter is for. Using io.open (because codecs.open is subtly broken in many ways, and io.open is both faster and more correct):

# If this is Py3, you don't even need the import, just use plain open which is
# an alias for io.open
import io

with io.open('compounds.dat', encoding='utf-8', errors='ignore') as f:
    for line in f:
        if u"InChI=1S/C11H8O3/c1-6-5-9(13)10-7(11(6)14)3-2-4-8(10)12/h2-5" in line:
            print(line)

will just ignore the invalid bytes (dropping them as if they never existed). You can also pass errors='replace' to insert a replacement character for each garbage byte, so you're not silently dropping data.

if working with huge data , better to use encoding as default and if the error persists then use errors="ignore" as well

with open("filename" , 'r'  , encoding="utf-8",errors="ignore") as f:
    f.read()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM