简体   繁体   中英

strip out binary data from text file in python

I have a text file that contains some binary data. When I read the file, using Python 3, in text mode I get an UniCodeDecodeError (codec can't decode byte...) with the following lines of code:

fo = open('myfile.txt, 'r')
for line in inFile:

How can I remove the binary data from my file. I have a header that is printed just before each binary data (in this case it is shown as Data Block). For example, my file looks like such where I want to remove the çºí?¼Èדñdí:

myfile.txt:

ABCDEFGH
123456
Data Block 11
çºí?¼Èדñdí
XYZ123

The result I want is for myfile.txt to look like this:

ABCDEFGH
123456
Data Block 11
XYZ123

This is difficult, because "binary" blobs may contain valid characters or character sequences. And if you're using a file that has "text" using multi-byte encoding, forget about it.

If you know the "text" in your file only contains single-byte characters, one approach would be to read the file in as bytes, then use something like

encode('ascii', error='ignore')

This effectively strips non-ascii characters out of the output, but if you were to do this on your file, you'd get:

ABCDEFGH
123456
Data Block
?d
XYZ123

Note the second to last line -- valid ascii characters were found in the blob and treated as "text".

You may start with a solution like that, and fine-tune it (if possible) to meet your needs. Maybe the blobs occur by themselves on lines so that if a line has any non-ascii characters, throw out the entire line completely. Maybe you can look at the blobs and try to grok some structure it has. Maybe you just settle for having random lines of partial characters in there and handle them somehow later. It's kind of application-specific at that point.

Here's the code I used to produce that output from your sample input:

def strip_nonascii(b):
    return b.decode('ascii', errors='ignore')

with open('garbled.txt', 'rb') as f:
    for line in f:
        print(strip_nonascii(line), end='')

如果在二进制数据之后也有页脚(例如具有标头),请尝试使用regexp将标头/页脚之间的所有内容全部替换为空吗?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM