[英]Read and interpret mixed binary and text file in python
How can I read a file that consists of a line (string of 10 csv) of numbers and text and then after this line, there are 4096 bytes? 如何读取包含数字和文本的行(10个csv的字符串),然后在此行之后有4096个字节的文件?
Something like this: 像这样:
117,47966,55,115,223,224,94,0,28,OK:
\00\00\00\F6\FF\EF\FFF\00\FA\FF\00\CA\FF\009\00Z\00\D9\FFF\00\E3\FF?\00\F0\FF\00\B1\FF\9D\FF\00:\00b\00\E9\FF*\00:\00\00)\00\D3\FF,\00\C6\FF\D6\FF2\00\00!\00\00\00\FE\FF\BA\FF[\00\E8\FF.\00\F7\FF\F9\FF\E6\FF\00\D3\FF\F8\FF\00&\00\
In the past, I've been using ConstBitStream to read pure binary files. 过去,我一直在使用ConstBitStream读取纯二进制文件。 I was wondering how can I read line by line and every time I find 'OK:', use ConstBitStream to read the following 4096 bytes?
我想知道如何逐行读取,并且每次找到'OK:'时,都使用ConstBitStream读取以下4096个字节吗?
with open(filename, encoding="latin-1") as f:
lines = f.readlines()
for i in range(1,len(lines)):
elements = lines[i].strip().split(',')
if(len(elements)==10):
readNext4096bytes()
Let me know if this works: 让我知道这个是否奏效:
import pandas as pd
from bitstring import ConstBitStream
# Read csv using pandas
df = pd.read_csv(filename, error_bad_lines=False, encoding='latin1')
# Take the last column (10) and cast every value to ConstBitStream
df.iloc[:, 9].apply(ConstBitStream)
Say your file is like this 说你的文件是这样的
1,2,3,OK: 4096 bytes 5,6,7,OK: 4096 bytes ...
file = open(file_name, 'rb').read()
file = open(file_name, 'rb').read()
data = file.split(b',OK:\\n')
data = file.split(b',OK:\\n')
data
is a list: [b'1,2,3', b'4096bytes\\n4,5,6', b'4096bytes\\n7,8,9', ..., b'4096bytes']
data
是一个列表: [b'1,2,3', b'4096bytes\\n4,5,6', b'4096bytes\\n7,8,9', ..., b'4096bytes']
bitarray, record = element[:4096], element[4096+1:]
bitarray, record = element[:4096], element[4096+1:]
PS if your file consists of ONE record and ONE bitarray, then data
is simply PS:如果您的文件包含一个记录和一个位数组,那么
data
就是
[b'1,2,3', b'4096bytes']
PPS if your binary string contains b',OK:\\n'
the method above fails but — The possible combinations of 5 bytes are 256**5
, the number of 5 bytes sequences in 4096 bytes is 4096+1-5
, hence the probability of this unfortunate possibility is 4092/256**5 → 3.7216523196548223e-09
* in a single binary record * — if you have a few record its probably OK, if you have a few millions records, well you need a lot of memory but the probability of an error is no more negligible. 如果您的二进制字符串包含
b',OK:\\n'
则PPS上面的方法失败,但是— 5个字节的可能组合为256**5
,以4096个字节为单位的5个字节序列的数目为4096+1-5
,因此这种不幸的可能性的概率是4092/256**5 → 3.7216523196548223e-09
*在单个二进制记录中* —如果有几条记录可能还可以,如果有几百万条记录,那么您需要大量内存但是出现错误的可能性已微不足道。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.