简体   繁体   English

什么放在二进制数据文件的标题中

[英]What to put in a binary data file's header

I have a simulation that reads large binary data files that we create (10s to 100s of GB). 我有一个模拟读取我们创建的大型二进制数据文件(10到100的GB)。 We use binary for speed reasons. 出于速度原因,我们使用二进制 These files are system dependent, converted from text files on each system that we run, so I'm not concerned about portability. 这些文件是系统相关的,从我们运行的每个系统上的文本文件转换而来,所以我不关心可移植性。 The files currently are many instances of a POD struct, written with fwrite. 这些文件当前是POD结构的许多实例,用fwrite编写。

I need to change the struct, so I want to add a header that has a file version number in it, which will be incremented anytime the struct changes. 我需要更改结构,所以我想添加一个文件版本号的标题,它会随着结构的变化而增加。 Since I'm doing this, I want to add some other information as well. 由于我这样做,我想添加一些其他信息。 I'm thinking of the size of the struct, byte order, and maybe the svn version number of the code that created the binary file. 我正在考虑结构的大小,字节顺序,以及创建二进制文件的代码的svn版本号。 Is there anything else that would be useful to add? 还有什么其他有用的东西可以添加吗?

In my experience, second-guessing the data you'll need is invariably wasted time. 根据我的经验,对你需要的数据进行二次猜测总是浪费时间。 What's important is to structure your metadata in a way that is extensible. 重要的是以可扩展的方式构建元数据 For XML files, that's straightforward, but binary files require a bit more thought. 对于XML文件,这很简单,但二进制文件需要更多考虑。

I tend to store metadata in a structure at the END of the file, not the beginning. 我倾向于将元数据存储在文件END的结构中,而不是开头。 This has two advantages: 这有两个好处:

  • Truncated/unterminated files are easily detected. 可以轻松检测到截断/未终止的文件。
  • Metadata footers can often be appended to existing files without impacting their reading code. 元数据页脚通常可以附加到现有文件而不会影响其读取代码。

The simplest metadata footer I use looks something like this: 我使用的最简单的元数据页脚看起来像这样:

struct MetadataFooter{
  char[40] creatorVersion;
  char[40] creatorApplication;
  .. or whatever
} 

struct FileFooter
{
  int64 metadataFooterSize;  // = sizeof(MetadataFooter)
  char[10] magicString;   // a unique identifier for the format: maybe "MYFILEFMT"
};

After the raw data, the metadata footer and THEN the file footer are written. 在原始数据之后,元数据页脚和文件页脚被写入。

When reading the file, seek to the end - sizeof(FileFooter). 在读取文件时,请搜索end-sizeof(FileFooter)。 Read the footer, and verify the magicString. 阅读页脚,并验证magicString。 Then, seek back according to metadataFooterSize and read the metadata. 然后,根据metadataFooterSize回顾并读取元数据。 Depending on the footer size contained in the file, you can use default values for missing fields. 根据文件中包含的页脚大小,您可以使用缺少字段的默认值。

As KeithB points out, you could even use this technique to store the metadata as an XML string, giving the advantages of both totally extensible metadata, with the compactness and speed of binary data. 正如KeithB指出的那样,您甚至可以使用这种技术将元数据存储为XML字符串,从而提供完全可扩展的元数据的优势,以及二进制数据的紧凑性和速度。

For large binaries I'd look seriously at HDF5 (Google for it). 对于大型二进制文件,我会认真对待HDF5(Google for it)。 Even if it's not something you want to adopt it might point you in some useful directions in designing your own formats. 即使它不是您想要采用的东西,它也可能指出您在设计自己的格式时有一些有用的方向。

For large binaries, in addition to the version number I tend to put a record count and CRC, the reason being that large binaries are much more prone to get truncated and/or corrupted over time or during transfer than smaller ones. 对于大型二进制文件,除了版本号之外,我倾向于记录计数和CRC,原因是大型二进制文件随着时间的推移或在传输期间比较小的二进制文件更容易被截断和/或损坏。 I found recently to my horror that Windows does not handle this well at all, as I used explorer to copy about 2TB across a couple of hundred files to an attached NAS device, and found 2-3 files on each copy were damaged (not completely copied). 我最近惊恐地发现Windows根本不能处理这个问题,因为我使用资源管理器将几百个文件中的2TB复制到连接的NAS设备上,并发现每个副本上有2-3个文件被损坏(不完全复制)。

An identifier for the type of the file would be useful if you will have other structures written to binary files later on. 如果稍后将其他结构写入二进制文件,则该文件类型的标识符将非常有用。 Maybe this could be a short string so you can see by a look into the file (via hex editor) what it contains. 也许这可能是一个短字符串,所以你可以通过查看文件(通过十六进制编辑器)看到它包含的内容。

If they're that large, I'd reserve a healthy chunk (64K?) of space at the beginning of the file and put the metadata there in XML format followed by an end-of-file character (Ctrl-Z for DOS/Windows, ctrl-D for unix?). 如果它们那么大,我会在文件开头保留一个健康的空间(64K?)空间,并将元数据放在XML格式中,然后是文件结束字符(Ctrl-Z表示DOS / Windows,ctrl-D for unix?)。 That way you can examine and parse the metadata easily with the wide range of toolsets out there for XML. 这样,您可以使用适用于XML的各种工具集轻松地检查和解析元数据。

Otherwise I go with what other people have already said: timestamp for file creation, identifier for which machine it's created on, basically anything else that you can think of for diagnostic purposes. 否则我会选择其他人已经说过的内容:文件创建的时间戳,创建它的机器的标识符,基本上你可以想到的任何其他用于诊断的东西。 And ideally you would include the definition of the structure format itself. 理想情况下,您将包含结构格式本身的定义。 If you are changing the structure often, it's a big pain to maintain the proper version of code around to read various formats of old datafiles. 如果您经常更改结构,那么维护适当版本的代码以阅读各种格式的旧数据文件会非常困难。

One big advantage of HDF5 as @highpercomp has mentioned, is that you just don't need to worry about changes in the structure format, as long as you have some convention of what the names and datatypes are. 正如@highpercomp所提到的HDF5的一大优势在于,您只需要担心结构格式的变化,只要您对名称和数据类型有一些约定即可。 The structure names and datatypes are all stored in the file itself, so you can blow your C code to smithereens and it doesn't matter, you can still retrieve data from an HDF5 file. 结构名称和数据类型都存储在文件本身中,因此您可以将C代码吹到smithereens并且无关紧要,您仍然可以从HDF5文件中检索数据。 It lets you worry less about the format of data and more on the structure of data, ie I don't care about the sequence of bytes, that's HDF5's problem, but I do care about field names and the like. 它让你不用担心数据的格式 ,更多地关注数据结构 ,即我不关心字节序列,这是HDF5的问题,但我确实关心字段名称等。

Another reason I like HDF5 is you can choose to use compression, which takes a very small amount of time and can give you huge wins in storage space if the data is slowly-changing or mostly the same except for a few errant blips of interestingness. 我喜欢HDF5的另一个原因是你可以选择使用压缩,这需要花费很少的时间,并且如果数据正在缓慢变化或者大部分相同,除了一些错误的有趣之处外,可以给你巨大的存储空间。

@rstevens said 'an identifier for the type of file'...sound advice. @rstevens说“文件类型的标识符”......声音建议。 Conventionally, that's called a magic number and, in a file, isn't a term of abuse (unlike in code, where it is a term of abuse). 传统上,这被称为幻数,并且在文件中,不是滥用的术语(与代码不同,它是滥用的术语)。 Basically, it is some number - typically at least 4 bytes, and I usually ensure that at least one of those bytes is not ASCII - that you can use to validate that the file is of the type you expect with a low probability of being confused. 基本上,它是一些数字 - 通常至少4个字节,我通常确保这些字节中至少有一个不是ASCII - 您可以使用它来验证文件是否是您期望的类型,并且混淆的可能性很小。 You can also write a rule in /etc/magic (or local equivalent) to report that files containing your magic number are your special file type. 您还可以在/ etc / magic(或本地等效项)中编写规则,以报告包含幻数的文件是您的特殊文件类型。

You should include a file format version number. 您应该包含文件格式版本号。 However, I would recommend not using the SVN number of the code. 但是,我建议不要使用代码的SVN号码。 Your code may change when the file format does not. 当文件格式没有时,您的代码可能会更改。

As my experience with telecom equipment configuration and firmware upgrades shows you only really need several predefined bytes at the begin (this is important) which starts from version (fixed part of header). 由于我对电信设备配置和固件升级的经验表明,您只需要在开始时(这很重要)从版本(标头的固定部分)开始实际需要几个预定义字节。 Rest of header is optional, by indicating proper version you can always show how to process it. 其余标题是可选的,通过指示正确的版本,您可以始终显示如何处理它。 Important thing here is you'd better place 'variable' part of header at the end of file. 这里重要的是你最好在文件末尾放置标题的'变量'部分。 If you plan operations on header without modifying file content itself. 如果您在标头上计划操作而不修改文件内容本身。 Also this simplify 'append' operations which should recalculate variable header part. 这也简化了“追加”操作,应该重新计算变量头部分。

Nice to have features for fixed size header (at the begin): 很高兴有固定大小标头的功能(在开始时):

  • Common 'length' field (including header). 常见的“长度”字段(包括标题)。
  • Something like CRC32 (including header). 类似于CRC32(包括标题)。

OK, for variable part XML or some pretty extensible format in header is good idea but is it really needed? 好吧,对于变量部分XML或标题中的一些漂亮的可扩展格式是个好主意,但它真的需要吗? I had lot of experience with ASN encoding... in most cases its usage was overshot. 我在ASN编码方面有很多经验......在大多数情况下,它的使用率都超过了。

Well, maybe you will have additional understanding when you look at things like TPKT format which is described in RFC 2126 (chapter 4.3). 好吧,当你看到RFC 2126 (第4.3章)中描述的TPKT格式之类的东西时,也许你会有额外的理解。

In addition to whatever information you need for schema versioning, add details that may be of value if you are troubleshooting an issue. 除了架构版本控制所需的任何信息之外,如果要解决问题,请添加可能有价值的详细信息。 For example: 例如:

  • timestamps of when the file was created and update (if applicable). 创建和更新文件的时间戳(如果适用)。
  • the version string from the build (ideally you have a version string that is auto-incremented on every 'official' build ... this is different to the file schema version). 来自构建的版本字符串(理想情况下,您有一个版本字符串,在每个“官方”构建中自动递增...这与文件架构版本不同)。
  • the name of the system creating the file, and maybe other statistics that are relevant to your app 创建文件的系统的名称,以及可能与您的应用相关的其他统计信息

We find this is very useful (a) in getting information we would otherwise have to ask the customer to provide and (b) getting correct information -- it is amazing how many customers report they are running a different version of the software to what the data claims! 我们发现这非常有用(a)获取我们原本不得不要求客户提供的信息以及(b)获取正确的信息 - 令人惊讶的是,有多少客户报告他们正在运行不同版本的软件而不是数据声称!

You might consider putting a file offset in a fixed position in the header, which tells you where the actual data begins in the file. 您可以考虑将文件偏移量放在标题中的固定位置,这会告诉您实际数据在文件中的开始位置。 This would let you change the size of the header when needed. 这样可以在需要时更改标题的大小。

In a couple of cases, I put the value 0x12345678 into the header so I could detect if the file format, matched the endianism of the machine that was processing it. 在几种情况下,我将值0x12345678放入标题中,以便我可以检测文件格式是否与正在处理它的机器的字节顺序相匹配。

My variation combines Roddy and Jason S's approaches. 我的变化结合了Roddy和Jason S的方法。

In summary - put formatted text metadata at the end of the file with a way to determine its length stored elsewhere. 总之 - 将格式化的文本元数据放在文件的末尾,以确定其存储在别处的长度。

1) Put an length field at the beginning of your file so you know the length of the metadata at the end rather than assuming a fixed length. 1)在文件的开头放置一个长度字段,以便在结束时知道元数据的长度,而不是假定固定的长度。 That way, to get the metadata you just read that fixed-length initial field and then get the metadata blob from the end of file. 这样,要获取元数据,您只需读取固定长度的初始字段,然后从文件末尾获取元数据blob。

2) Use XML or YAML or JSON for the metadata. 2)使用XML或YAML或JSON作为元数据。 This is especially useful/safe if the metadata is appended at the end because nobody reading the file is going to automatically think it's all XML just because it starts with XML. 如果在末尾附加元数据,这是特别有用/安全的,因为没有人阅读文件会自动认为它只是因为XML以XML开头。

The only disadvantage in this approach is when your metadata grows, you have to update both the head of the file and the tail but it's likely other parts will have been updated anyway. 这种方法的唯一缺点是当你的元数据增长时,你必须更新文件的头部和尾部,但是其他部分可能仍然会被更新。 If it's just updating trivia like a last-accessed date then the metadata length won't change so it only needs an update in-place. 如果它只是像上次访问日期一样更新琐事,那么元数据长度不会改变,所以它只需要就地更新。

If you are putting a version number in the header you can change that version anytime you need to change the POD struct or add new fields to the header. 如果要在标题中放置版本号,则可以在需要更改POD结构或向标题添加新字段时随时更改该版本。

So don't add stuff to the header now because it might be interesting. 因此,现在不要在标题中添加内容,因为它可能很有趣。 You are just creating code that you have to maintain but that has little real value. 您只需创建必须维护的代码,但这些代码几乎没有实际价值。

对于大型文件,您可能希望添加数据定义,因此您的文件格式将成为自我描述的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM