用Java处理大量数据

Question

I am implementing a clustering algorithm on a large dataset. 我正在大型数据集上实现聚类算法。 The dataset is in a text file and it contains over 100 million records. 数据集在文本文件中，包含超过1亿条记录。 Each record contains 3 numeric fields. 每个记录包含3个数字字段。

1,1503895,4
3,2207774,5
6,2590061,3
...

I need to keep all this data in memory if possible, since as per my clustering algorithm, I need to randomly access records in this file. 如果可能，我需要将所有这些数据保留在内存中，因为根据我的群集算法，我需要随机访问该文件中的记录。 There fore I can't do any partition and merging approaches as described in Find duplicates in large file 因此，我无法执行“ 查找大文件中的重复项”中所述的任何分区和合并方法

What are possible solutions to this problem? 有什么可能的解决方案？ Can I use caching techniques like ehcache? 我可以使用ehcache等缓存技术吗？

Answer 1

300 million ints shouldnt consume that much memory. 3亿个整数不应该消耗那么多内存。 Try instantiating an array of 300 million ints. 尝试实例化一个3亿个整数的数组。 Back of my hand calculation, on a 64 bit machine, is about 1.2 GB. 在64位计算机上，我的实际计算约为1.2 GB。

用Java处理大量数据

问题描述

1 个解决方案

解决方案1
0 2013-01-26 00:23:43

用Java处理大量数据

问题描述

1 个解决方案

解决方案1 0 2013-01-26 00:23:43

解决方案1
0 2013-01-26 00:23:43