简体繁体 English

如何处理大型数据列表

[英]how to handle large lists of data

原文 2009-11-04 00:39:45 9 5 java/ algorithm/ collections/ list/ large-data-volumes

We have a part of an application where, say, 20% of the time it needs to read in a huge amount of data that exceeds memory limits. 我们有一个应用程序，其中，说的一部分，它需要超过内存限制的数据量巨大阅读时间的20％。 While we can increase memory limits, we hesitate to do so to since it requires having a high allocation when most times it's not necessary. 虽然我们可以增加内存限制，但我们不愿意这样做，因为它需要在大多数情况下都没有必要时进行高分配。

We are considering using a customized java.util.List implementation to spool to disk when we hit peak loads like this, but under lighter circumstances will remain in memory. 当我们达到这样的峰值负载时，我们正在考虑使用自定义的java.util.List实现来假脱机到磁盘，但在较轻的情况下将保留在内存中。

The data is loaded once into the collection, subsequently iterated over and processed, and then thrown away. 数据一次加载到集合中，随后迭代并处理，然后丢弃。 It doesn't need to be sorted once it's in the collection. 它不需要在集合中进行排序。

Does anyone have pros/cons regarding such an approach? 有没有人对这种方法有利弊？

Is there an open source product that provides some sort of List impl like this? 是否有一个开源产品提供这样的List impl？

Thanks! 谢谢！

Updates: 更新：

Not to be cheeky, but by 'huge' I mean exceeding the amount of memory we're willing to allocate without interfering with other processes on the same hardware. 不是厚颜无耻，而是“巨大”，我的意思是超出我们愿意分配的内存量，而不会干扰同一硬件上的其他进程。 What other details do you need? 你需要什么其他细节？
The application is, essentially a batch processor that loads in data from multiple database tables and conducts extensive business logic on it. 该应用程序本质上是一个批处理器，它从多个数据库表中加载数据并在其上执行广泛的业务逻辑。 All of the data in the list is required since aggregate operations are part of the logic done. 列表中的所有数据都是必需的，因为聚合操作是完成逻辑的一部分。
I just came across this post which offers a very good option: STXXL equivalent in Java 我刚刚看到这篇文章提供了一个非常好的选择： STXXL相当于Java

5 个解决方案

Do you really need to use a List? 你真的需要使用List吗？ Write an implementation of Iterator (it may help to extend AbstractIterator ) that steps through your data instead. 编写Iterator的实现（可能有助于扩展AbstractIterator ），而不是逐步执行数据。 Then you can make use of helpful utilities like these with that iterator. 然后，您可以使用该迭代器来使用这些有用的实用程序。 None of this will cause huge amounts of data to be loaded eagerly into memory -- instead, records are read from your source only as the iterator is advanced. 这些都不会导致大量数据急切地加载到内存中 - 相反，只有在迭代器处于高级状态时才会从源中读取记录。

如果您正在处理大量数据，则可能需要考虑使用数据库。

Back it up to a database and do lazy loading on the items. 将其备份到数据库并对项目进行延迟加载。

An ORM framework may be in order. ORM框架可能是有序的。 It depends on your usage. 这取决于您的使用情况。 It may be pretty straight forward, or the worst of your nightmares it is hard to tell from what you've described. 这可能是相当直接的，或者是你最糟糕的噩梦很难从你所描述的内容中分辨出来。

I'm optimist and I think that using a ORM framework ( such as Hibernate ) would solve your problem in about 3 - 5 days 我很乐观，我认为使用ORM框架（如Hibernate）可以在大约3-5天内解决您的问题

Is there sorting/processing that's going on while the data is being read into the collection? 在将数据读入集合时是否正在进行排序/处理？ Where is it being read from? 从哪里读取？

If it's being read from disk already, would it be possible to simply batch-process it directly from disk, instead of reading it into a list completely and then iterating? 如果它已经从磁盘读取，是否可以直接从磁盘批量处理它，而不是完全将其读入列表然后迭代？ How inter-dependent is the data? 数据如何相互依赖？

I would also question why you need to load all of the data in memory to process it. 我还想问为什么你需要加载内存中的所有数据来处理它。 Typically, you should be able to do the processing as it is being loaded and then use the result. 通常，您应该能够在加载时进行处理，然后使用结果。 That would keep the actual data out of memory. 这将使实际数据保持在内存中。