簡體   English   中英

從二進制文件python讀取數據

[英]Read data from binary file python

我有一個具有這種格式的二進制文件:

在此處輸入圖片說明

我用下面的代碼打開它:

import numpy as np

f = open("author_1", "r")

dt = np.dtype({'names': ['au_id','len_au_name','au_name','nu_of_publ', 'pub_id', 'len_of_pub_id','pub_title','num_auth','len_au_name_1', 'au_name1','len_au_name_2', 'au_name2','len_au_name_3', 'au_name3','year_publ','num_of_cit','citid','len_cit_tit','cit_tit', 'num_of_au_cit','len_cit_au_name_1','au_cit_name_1', len_cit_au_name_2',
'au_cit_name_2','len_cit_au_name_3','au_cit_name_3','len_cit_au_name_4',
'au_cit_name_4', 'len_cit_au_name_5','au_cit_name_5','year_cit'],
         'formats': [int,int,'S13',int,int,int,'S61',          int,int,'S8',int,'S7',int,'S12',int,int,int,int,'S50',int,int,
                'S7',int,'S7',int,'S9',int,'S8',int,'S1',int]})
a = np.fromfile(f, dtype=dt, count=-1, sep="")

我認為:

array([ (1, 13, b'Scott Shenker', 200, 1, 61, b'Integrated services in the internet architecture: an overview', 3, 8, b'R Braden', 7, b'D Clark', 12, b'S Shenker\xe2\x80\xa6', 1994, 1000, 401, 50, b'[HTML] An architecture for differentiated services', 5, 7, b'D Black', 7, b'S Blake', 9, b'M Carlson', 8, b'E Davies', 1, b'Z', 1998),
(402, 72, b'Resource rese', 1952544370, 544108393, 1953460848, b'ocol (RSVP)--Version 1 functional specification\x05\x00\x00\x00\x08\x00\x00\x00R Brad', 487013, 541851648, b'Zhang\x08', 1109414656, b'erson\x08', 542310400, b'Herzog\x07\x00\x00\x00S ', 1768776010, 511342, 103168, 22016, b'\x00A reliable multicast framework for light-weight s', 1769173861, 544435823, b'and app', 1633905004, b'tion le', 543974774, b'framing\x04', 458752, b'\x00\x00S Floy', 2660, b'', 1632247894),

知道如何打開整個文件嗎?

存儲在此文件中的數據結構是分層的,而不是“扁平的”:不同長度的子數組存儲在每個父元素中。 無法使用numpy數組(甚至recarrays)表示這種數據結構,因此無法使用np.fromfile()讀取文件。

“打開整個文件”是什么意思? 您最終想要什么樣的python數據結構?

編寫一個將文件解析為字典列表的函數將很簡單,但仍然不簡單。

我同意賴安(Ryan)的觀點:解析數據很簡單,但並不瑣碎,而且非常乏味。 無論通過這種方式打包數據節省了多少磁盤空間,您在拆包時就付出了高昂的代價。

無論如何,文件是由可變長度的記錄和字段組成的。 每個記錄都是由可變數量和可變長度的字段組成的,我們可以按字節塊讀取。 每個塊將具有不同的格式。 你明白了。 按照此邏輯,我組裝了這三個功能,您可以完成,修改,測試等:

from struct import Struct
import struct

def read_chunk(fmt, fileobj):
    chunk_struct = Struct(fmt)
    chunk = fileobj.read(chunk_struct.size)
    return chunk_struct.unpack(chunk)

def read_record(fileobj):
    author_id, len_author_name = read_chunk('ii', f)
    author_name, nu_of_publ = read_chunk(str(len_author_name)+'ci', f) # 's' or 'c' ?
    record = {  'author_id': author_id,
                'author_name': author_name,
                'publications': [] }
    for pub in range(nu_of_publ):
        pub_id, len_pub_title = read_chunk('ii', f)
        pub_title, num_pub_auth = read_chunk(str(len_pub_title)+'ci', f)
        record['publications'].append({
                'publication_id': pub_id,
                'publication_title': pub_title,
                'publication_authors': [] })
        for auth in range(num_pub_auth):
            len_pub_auth_name = read_chunk('i', f)
            pub_auth_name = read_chunk(str(len_pub_auth_name)+'c', f)
            record['publications']['publication_authors'].append({'name': pub_auth_name})
        year_publ, nu_of_cit = read_chunk('ii', f)
        # Finish building your record with the remaining fields...
        for cit in range(nu_of_cit):
            cit_id, len_cit_title = read_chunk('ii', f)
            cit_title, num_cit_auth = read_chunk(str(len_cit_title)+'ci', f)
        for cit_auth in range(num_cit_auth):
            len_cit_auth_name = read_chunk('i', f)
            cit_auth_name = read_chunk(str(len_cit_auth_name)+'c', f)
        year_cit_publ = read_chunk('i', f)
    return record

def parse_file(filename):
    records = []
    with open(filename, 'rb') as f:
        while True:
            try:
                records.append(read_record(f))
            except struct.error:
                break
    # do something useful with the records...

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM