简体   繁体   中英

What data structure to use for big data

I have an excel sheet with a million rows. Each row has 100 columns. Each row represents an instance of a class with 100 attributes, and the columns values are the values for these attributes.

What data structure is the most optimal for use here, to store the million instance of data?

Thanks

It really depends on how you need to access this data and what you want to optimize for – like, space vs. speed.

  • If you want to optimize for space, well, you could just serialize and compress the data, but that would likely be useless if you need to read/manipulate the data.
  • If you access by index, the simplest thing is an array of arrays.
  • If you instead use an array of objects, where each object holds your 100 attributes, you have a better way to structure your code (encapsulation!)
  • If you need to query/search the data, it really depends on the kind of queries. You may want to have a look at BST data structures...

If you want to store all the data in memory, you can use one of the implementations of Table from Guava, typically ArrayTable for dense tables or HashBasedTable if most cells are expected to be empty. Otherwise, a database (probably with some cache system like ehcache or terracota) would be a better shot.

One million rows with 100 values where is each value uses 8 bytes of memory is only 800 MB which will easily fit into the memory of most PC esp if they are 64-bit. Try to make the type of each column as compact as possible.

A more efficient way of storing the data is by column. ie you have array for each column with a primitive data type. I suspect you don't even need to do this.

If you have many more rows eg billions, you can use off heap memory ie memory mapped files and direct memory. This can efficient store more data than you have main memory while keeping you heap relatively small. (eg 100s of GB off-heap with 1 GB in heap)

您最好的选择是将它们存储在实际数据库中的表中,例如 Postgres 等。这些已针对您所谈论的内容进行了优化!

在那种数据中,我更喜欢使用 MYSQL 数据库,因为它速度更快,并且可以累积这样的大文件。

最好的选择是使用可以存储大量数据且速度足够快的数据库,以便更快地访问,例如 ORACLE、MSSQL、MYSQL 以及任何其他速度快且可以存储大量数据的数据库。

If you really have a million rows or more with 100 values each, I doubt it will all fit into your memory... or is there a special reason for it? For example poor performance using a database?

Since you wnat to have random access, I'd use a persistence provider like hibernate and some database you like (for example mysql).

But be aware that the way you use the persistence provider has a great impact on performance. For example you should use batch-inserts (which are incompatible with autogenerated ids).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM