处理大数据二进制文件

Question

I am working with large binary files (aprox 2 Gb each) that contain raw data.我正在处理包含原始数据的大型二进制文件（每个大约 2 Gb）。 These files have a well defined structure, where each file is an array of events , and each event is an array of data banks .这些文件具有明确定义的结构，其中每个文件是阵列events ，每个事件是阵列data banks 。 Each event and data bank have a structure ( header , data type , etc.).每个event和data bank都有一个结构（ header 、 data type等）。

From these files, all I have to do is extract whatever data I might need, and then I just analyze and play with the data.从这些文件中，我所要做的就是提取我可能需要的任何数据，然后我就可以分析和处理这些数据。 I might not need all of the data, sometimes I just extract XType data, other just YType , etc.我可能不需要所有数据，有时我只是提取XType数据，其他只是YType等。

I don't want to shoot myself in the foot, so I am asking for guidance/best practice on how to deal with this.我不想用脚射击自己，所以我寻求有关如何处理此问题的指导/最佳实践。 I can think of 2 possibilities:我可以想到两种可能性：

Option 1选项1

Define a DataBank class, this will contain the actual data ( std::vector<T> ) and whatever structure this has.定义一个DataBank类，它将包含实际数据（ std::vector<T> ）以及它具有的任何结构。
Define a Event class, this has a std::vector<DataBank> plus whatever structure.定义一个Event类，它有一个std::vector<DataBank>加上任何结构。
Define a MyFile class, this is a std::vector<Event> plus whatever structure.定义一个MyFile类，这是一个std::vector<Event>加上任何结构。

The constructor of MyFile will take a std:string (name of the file), and will do all the heavy lifting of reading the binary file into the classes above. MyFile的构造函数将采用std:string （文件名），并将完成将二进制文件读入上述类的所有繁重工作。

Then, whatever I need from the binary file will just be a method of the MyFile class;然后，无论我需要从二进制文件中获取什么，都只是MyFile类的一个方法； I can loop through Events , I can loop through DataBanks , everything I could need is already in this "unpacked" object.我可以遍历Events ，我可以遍历DataBanks ，我需要的一切都已经在这个“解压”的对象中了。

The workflow here would be like:这里的工作流程如下：

int main() {
    MyFile data_file("data.bin");
    std::vector<XData> my_data = data_file.getXData();
    \\Play with my_data, and never again use the data_file object
    \\...
    return 0;
}

Option 2选项 2

Write functions that take std::string as an argument, and extract whatever I need from the file eg std::vector<XData> getXData(std::string) , int getNumEvents(std::string) , etc.编写将std::string作为参数的函数，并从文件中提取我需要的任何内容，例如std::vector<XData> getXData(std::string) 、 int getNumEvents(std::string)等。

The workflow here would be like:这里的工作流程如下：

int main() {
    std::vector<XData> my_data = getXData("data.bin");
    \\Play with my_data, and I didn't create a massive object
    \\...
    return 0;
}

Pros and Cons that I see我看到的利弊

Option 1 seems like a cleaner option, I would only "unpack" the binary file once in the MyFile constructor.选项 1 似乎是一个更简洁的选项，我只会在MyFile构造函数中“解压”二进制文件一次。 But I will have created a huge object that contains all the data from a 2 Gb file, which I will never use.但是我将创建一个巨大的对象，其中包含来自 2 Gb 文件的所有数据，我永远不会使用它。 If I need to analyze 20 files (each of 2 Gb), will I need 40 Gb of ram?如果我需要分析 20 个文件（每个文件 2 Gb），我需要 40 Gb 的内存吗？ I don't understand how these are handled, will this affect performance?我不明白这些是如何处理的，这会影响性能吗？

Option number 2 seems to be faster;选项 2 似乎更快； I will just extract whatever data I need, and that's it, I won't "unpack" the entire binary file just to later extract the data I care about.我将只提取我需要的任何数据，就是这样，我不会“解压缩”整个二进制文件只是为了稍后提取我关心的数据。 The problem is that I will have to deal with the binary file structure in every function;问题是我必须在每个函数中处理二进制文件结构； if this ever changes, that will be a pain.如果这种情况发生变化，那将是一种痛苦。 I will only create objects of the data I will play with.我只会创建我将使用的数据的对象。

As you can see from my question, I don't have much experience with dealing with large structures and files.从我的问题中可以看出，我在处理大型结构和文件方面没有太多经验。 I appreciate any advice.我很感激任何建议。

Answer 1

I do not know whether the following scenario matches yours.我不知道以下场景是否符合您的情况。

I had a case of processing huge log files of hardware signal logging in the automotive area.我有一个处理汽车领域硬件信号记录的巨大日志文件的案例。 Signals like door locked, radio on, temperature, and thousands more, appearing sometimes periodically.诸如门锁、无线电打开、温度等信号，有时会定期出现。 The operator selects some signal types and then analizes diagrams of signal values.操作员选择一些信号类型，然后分析信号值的图表。

This scenario is based on a huge log file growing on passing time.此方案基于随着时间的推移而增长的巨大日志文件。

What I did was for every signal type creating its own logfile extract, in optimized binary format (one would load a fixed sized byte[] array).我所做的是为每种信号类型创建自己的日志文件提取，以优化的二进制格式（一个将加载固定大小的字节 [] 数组）。

This meant that having the diagram for just 10 types would be feasible to display fast, in real time.这意味着只有 10 种类型的图表可以快速、实时地显示。 Zooming in on a time interval, dynamically selecting signal types, and so on.放大时间间隔、动态选择信号类型等。

I hope you got some ideas.我希望你有一些想法。

处理大数据二进制文件

问题描述

1 个解决方案

解决方案1
0 2021-10-18 15:03:13

处理大数据二进制文件

问题描述

1 个解决方案

解决方案1 0 2021-10-18 15:03:13

解决方案1
0 2021-10-18 15:03:13