简体   繁体   English

Java中的巨大字符串表

[英]Huge String Table in Java

I've got a question about storing huge amount of Strings in application memory. 我有一个关于在应用程序内存中存储大量字符串的问题。 I need to load from file and store about 5 millions lines, each of them max 255 chars (urls), but mostly ~50. 我需要从文件中加载大约5百万行,每行最多255个字符(url),但大多数是〜50。 From time to time i'll need to search one of them. 我有时需要搜索其中一个。 Is it possible to do this app runnable on ~1GB of RAM? 是否有可能在~1GB的RAM上运行这个应用程序?

Will

ArrayList <String> list = new ArrayList<String>();

work? 工作?

As far as I know String in java is coded in UTF-8, what gives me huge memory use. 据我所知,java中的String是用UTF-8编码的,这给了我巨大的内存使用量。 Is it possible to make such array with String coded in ANSI? 是否可以使用ANSI编码的字符串来生成这样的数组?

This is console application run with parameters: 这是使用参数运行的控制台应用程序

java -Xmx1024M -Xms1024M -jar "PServer.jar" nogui

The latest JVMs support -XX:+UseCompressedStrings by default which stores strings which only use ASCII as a byte[] internally. 最新的JVM默认支持-XX:+UseCompressedStrings ,它存储仅在内部使用ASCII作为byte []的字符串。

Having several GB of text in a List isn't a problem, but it can take a while to load from disk (many seconds) 列表中有几GB的文本不是问题,但从磁盘加载可能需要一段时间(很多秒)

If the average URL is 50 chars which are ASCII, with 32 bytes of overhead per String, 5 M entries could use about 400 MB which isn't much for a modern PC or server. 如果平均URL是50个字符,这些字符是ASCII,每个字符串有32个字节的开销,则5个条目可以使用大约400 MB,这对于现代PC或服务器来说并不多。

A Java String is a full blown object. Java String是一个完整的对象。 This means that appart from the characters of the string theirselves, there is other information to store in it (a pointer to the class of the object, a counter with the number of pointers pointing to it, and some other infrastructure data). 这意味着appart来自字符串的字符,还有其他信息存储在其中(指向对象类的指针,指向它的指针数量的计数器,以及一些其他基础结构数据)。 So an empty String already takes 45 bytes in memory (as you can see here ). 所以一个空的String已经在内存中占用了45个字节 (正如你在这里看到的)。 Now you just have to add the maximum lenght of your string and make some easy calculations to get the maximum memory of that list. 现在您只需添加字符串的最大长度并进行一些简单的计算即可获得该列表的最大内存。

Anyway, I would suggest you to load the string as byte[] if you have memory issues. 无论如何,如果你有内存问题,我建议你把字符串加载为byte [] That way you can control the encoding and you can still do searchs. 这样你就可以控制编码,你仍然可以进行搜索。

Is there some reason you need to restrict it to 1G? 是否有某些原因需要将其限制为1G? If you want to search through them, you definitely don't want to swap to disk, but if the machine has more memory it makes sense to go higher then 1G. 如果你想搜索它们,你肯定不想换到磁盘,但如果机器有更多的内存,那么高于1G是有意义的。

If you have to search, use a SortedSet , not an ArrayList 如果必须搜索,请使用SortedSet ,而不是ArrayList

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM