简体   繁体   English

用Java处理大量数据

[英]Processing large amount of data in java

I am implementing a clustering algorithm on a large dataset. 我正在大型数据集上实现聚类算法。 The dataset is in a text file and it contains over 100 million records. 数据集在文本文件中,包含超过1亿条记录。 Each record contains 3 numeric fields. 每个记录包含3个数字字段。

1,1503895,4
3,2207774,5
6,2590061,3
...

I need to keep all this data in memory if possible, since as per my clustering algorithm, I need to randomly access records in this file. 如果可能,我需要将所有这些数据保留在内存中,因为根据我的群集算法,我需要随机访问该文件中的记录。 There fore I can't do any partition and merging approaches as described in Find duplicates in large file 因此,我无法执行“ 查找大文件中的重复项”中所述的任何分区和合并方法

What are possible solutions to this problem? 有什么可能的解决方案? Can I use caching techniques like ehcache? 我可以使用ehcache等缓存技术吗?

300 million ints shouldnt consume that much memory. 3亿个整数不应该消耗那么多内存。 Try instantiating an array of 300 million ints. 尝试实例化一个3亿个整数的数组。 Back of my hand calculation, on a 64 bit machine, is about 1.2 GB. 在64位计算机上,我的实际计算约为1.2 GB。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM