简体繁体 English

如何处理大量浮点数据？

[英]how to handle large amount of float data?

原文 2011-05-11 14:33:38 9 7 java

We have a binary file which contains a large amount of float data (about 80MB). 我们有一个包含大量float数据（大约80MB）的二进制文件。 we need to process it in our Java application. 我们需要在Java应用程序中处理它。 The data is from a medical scanner. 数据来自医疗扫描仪。 One file contains data from one Rotation . 一个文件包含一个Rotation数据。 One Rotation contains 960 Views . 一个Rotation包含960个Views 。 One View contains 16 Rows and one Rows contains 1344 Cells . 一个View包含16 Rows和一个Rows包含1344个Cells 。 Those numbers (their relationship) are fixed. 这些数字（他们的关系）是固定的。

We need to read ALL the floats into our application with a code structure reflect above structure about Rotation-view-row-cell . 我们需要将所有浮点数读入我们的应用程序，其代码结构反映了关于Rotation-view-row-cell上述结构。

What we are doing now is using float[] to hold data for Cells and then using ArrayList for Rotation , View and Row to hold their data. 我们现在正在做的是使用float[]来保存Cells数据，然后使用ArrayList for Rotation ， View和Row来保存它们的数据。

I have two questions: 我有两个问题：

how to populate the Cell data (read floats into our float[]) quickly? 如何快速填充Cell数据（将浮点数读入浮点数[]）？
do you have better idea to hold those data? 你有更好的想法来保存这些数据吗？

7 个解决方案

Assuming you don't make changes to the data (add more views, etc.) why not put everything in one big array? 假设您没有对数据进行更改（添加更多视图等），为什么不将所有内容放在一个大数组中？ The point of ArrayLists is you can grow and shrink them, which you don't need here. ArrayLists的重点是你可以增长和缩小它们，这里你不需要它们。 You can write access methods to get the right cell for a given view, rotation, etc. 您可以编写访问方法以获取给定视图，旋转等的正确单元格。

Using arrays of arrays is a better idea, that way the system is figuring out how to access what for you and it is just as fast as a single array. 使用数组数组是一个更好的主意，这样系统就可以确定如何为您访问什么，它就像单个数组一样快。

Michael is right, you need to buffer the input, otherwise you will be doing a file access operation for every byte and your performance will be awful. 迈克尔是对的，你需要缓冲输入，否则你将对每个字节进行文件访问操作，你的性能会很糟糕。

If you want to stick with the current approach as much as possible, you can minimize the memory used by your ArrayLists by setting their capacity to the number of elements they hold. 如果您希望尽可能地坚持当前的方法，可以通过将其容量设置为它们所容纳的元素数来最小化ArrayLists使用的内存。 Otherwise they keep a number of slots in reserve, expecting you to add more. 否则他们会保留许多插槽，期望您添加更多插槽。

Use a DataInputStream (and its readFloat() method) wrapping a FileInputStream , possibly with e BufferedInputStream in between (try whether the buffer helps performance or not). 使用DataInputStream （及其readFloat()方法）包装FileInputStream ，可能在其间使用e BufferedInputStream （尝试缓冲区是否有助于提高性能）。
Your data structure looks fine. 您的数据结构看起来很好。

Are you having any particular performance/usage problems with your current approach? 您当前的方法是否有任何特定的性能/使用问题？

The only thing I can suggest based on the information that you provide is to try representing a View as float[][] of rows and cells. 根据您提供的信息，我唯一可以建议的是尝试将View表示为行和单元格的float [] []。

For the data loading: 对于数据加载：

DataInputStream should work well. DataInputStream应该可以正常工作。 But make sure you wrap the underlying FileInputStream in a BufferedInputStream, otherwise you run the risk of doing I/O operations for every float which can kill performance. 但请确保将基础FileInputStream包装在BufferedInputStream中，否则您将面临为每个可能会降低性能的浮动执行I / O操作的风险。

Several options for holding the data: 保存数据的几个选项：

The (very marginally) most memory-efficient way will be to store the entire array in on large float[], and calculate offsets into it as needed. （非常简单地）最节省内存的方法是将整个数组存储在大型float []上，并根据需要计算偏移量。 A bit ugly to use, but might make sense if you are doing a lot of calculations or processing loops over the entire set. 使用起来有点难看，但如果您在整个集合中进行大量计算或处理循环，则可能有意义。
The most "OOP" way would be to have separate objects for Rotation, View, Row and Cell. 最“OOP”的方式是为Rotation，View，Row和Cell提供单独的对象。 But having each cell as a separate object is pretty wasteful, might even blow your memory limits. 但是将每个单元作为一个单独的对象是非常浪费的，甚至可能会打破你的内存限制。
You could use nested ArrayLists with a float[1344] to represent the lowest level data for the cells in each row. 您可以使用带有float [1344]的嵌套ArrayLists来表示每行中单元格的最低级别数据。 I understand this is what you're currently doing - in fact I think it's a pretty good choice. 我明白这就是你现在正在做的事情 - 事实上我觉得这是一个不错的选择。 The overhead of the ArrayLists won't be much compared to the overall data size. 与整体数据大小相比，ArrayLists的开销不会太大。
A final option would be to use a float[rotationNum][rowNum][cellNum] to represent each rotation. 最后一个选项是使用float [rotationNum] [rowNum] [cellNum]来表示每个旋转。 A bit more efficient than ArrayLists, but arrays are usually less nice to manipulate. 比ArrayLists更有效，但数组操作通常不太好。 However this seems a pretty good option if, as you say, the array sizes will always be fixed. 但是，如果正如您所说，数组大小将始终固定，这似乎是一个非常好的选择。 I'd probably choose this option myself. 我可能自己选择这个选项。

I also think that you can put all your data structure into a float[][][] (same as Nathan Hughes suggests). 我还认为你可以将所有数据结构放入float[][][] （与Nathan Hughes建议的相同）。 You could have a method that reads your file and return a float[][][] , where the first dimension is that of views (960), the second is that of rows (16), and the third is that of cells (1344): if those numbers are fixes, you'd better use this approach: you save memory, and it's faster. 你可以有一个方法来读取你的文件并返回一个float[][][] ，其中第一个维度是视图的维度（960），第二个维度是行（16），第三个维度是细胞的维度（ 1344）：如果这些数字是修复，你最好使用这种方法：你节省内存，而且速度更快。

80 MB shouldn't be so much data that you need to worry so terribly much. 80 MB应该不是那么多你需要担心的数据。 I would really suggest: 我真的建议：

create Java wrapper objects representing the most logical structure/hierarchy for the data you have; 创建Java包装器对象，表示您拥有的数据的最逻辑结构/层次结构;
one way or another, ensure that you're only making an actual "raw" I/O call (so an InputStream.read() or equivalent) every 16K or so of data-- eg you could read into a 16K/32K byte array that is wrapped in a ByteBuffer for the purpose of pulling out the floats, or whatever you need for your data; 无论如何，确保您每16K左右的数据只进行一次实际的“原始”I / O调用（所以是一个InputStream.read（）或等效的） - 例如，您可以读入16K / 32K字节包含在ByteBuffer中的数组，用于拉出浮点数，或者您需要的数据;
if you actually have a performance problem with this approach, try to identify, not second-guess, what that performance problem actually is. 如果你真的遇到这种方法的性能问题，试着找出，而不是第二次猜测，实际上是什么性能问题。