简体   繁体   English

用于解析大数据文件的合适Java数据结构

[英]Suitable Java data structure for parsing large data file

I have a rather large text file (~4m lines) I'd like to parse and I'm looking for advice about a suitable data structure in which to store the data. 我有一个相当大的文本文件(~4m行)我想解析,我正在寻找有关存储数据的合适数据结构的建议。 The file contains lines like the following: 该文件包含如下行:

Date        Time    Value
2011-11-30  09:00   10
2011-11-30  09:15   5
2011-12-01  12:42   14
2011-12-01  19:58   19
2011-12-01  02:03   12

I want to group the lines by date so my initial thought was to use a TreeMap<String, List<String>> to map the date to the rest of the line but is a TreeMap of List sa ridiculous thing to do? 我想按日期对行进行分组,所以我最初的想法是使用TreeMap<String, List<String>>将日期映射到行的其余部分,但是ListTreeMap是一个荒谬的事情吗? I suppose I could replace the String key with a date object (to eliminate so many string comparisons) but it's the List as a value that I'm worried might be unsuitable. 我想我可以用日期对象替换String键(以消除这么多的字符串比较),但它是List作为我担心可能不适合的值。

I'm using a TreeMap because I want to iterate the keys in date order. 我正在使用TreeMap因为我想按日期顺序迭代键。

There's nothing wrong with using a List as the value for a Map . 使用List作为Map的值没有任何问题。 All of those <> look ugly, but it's perfectly fine to put a generics class inside of a generics class. 所有这些<>看起来都很难看,但是将泛型类放在泛型类中是完全没问题的。

Instead of using a String as the key, it would probably be better to use java.util.Date because the keys are dates. 使用java.util.Date可能更好,而不是使用String作为键,因为键是日期。 This will allow the TreeMap to more accurately sort the dates. 这将允许TreeMap更准确地对日期进行排序。 If you store the dates as Strings , then the TreeMap may not properly sort the dates (they will be sorted as strings, not as "real" dates). 如果将日期存储为Strings ,则TreeMap可能无法正确排序日期(它们将按字符串排序,而不是“真实”日期)。

Map<Date, List<String>> map = new TreeMap<Date, List<String>>();

is a TreeMap of Lists a ridiculous thing to do? 列表的TreeMap是一个荒谬的事情吗?

Conceptually not, but it is going to be very memory-inefficient (both because of the Map and because of the List ). 从概念上讲不是,但它会非常低效(因为MapList )。 You're looking at an overhead of 200% or more. 您正在考虑200%或更高的开销。 Which may or may not be acceptable, depending on how much memory you have to waste. 取决于您需要浪费多少内存,这可能是可接受的,也可能是不可接受的。

For a more memory-efficient solution, create a class that has fields for every column (including a Date ), put all those in a List and sort it (ideally using quicksort) when you're done reading. 要获得更高内存效率的解决方案,请创建一个包含每列(包括Date )字段的类,将所有这些字段放入List并在完成阅读后对其进行排序(理想情况下使用快速排序)。

There is no objection against using Lists. 没有人反对使用列表。 Though in your case maybe a List<Integer> as values of the Map would be appropriate. 虽然在您的情况下可能是List<Integer>因为Map的值是合适的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM