[英]Java String Memory Performance Improvement for Text File Parser
I'm trying to optimize my memory usage when dealing with many (millions) of strings. 在处理许多(数百万)字符串时,我正在尝试优化内存使用率。 I have a file of ~1.5 million lines that I'm reading through that contains decimal numbers in columns.
我正在读取的文件约为150万行,其中包含十进制数的列。 For example, a file may look like
例如,一个文件可能看起来像
16916576.643 4 -12312674.246 4 39.785 4 16916584.123 3 -5937726.325 3
36.794 3
16399226.418 6 -4129008.232 6 43.280 6 16399225.374 4 -1891751.787 4
39.885 4
12415561.671 9 -33057782.339 9 52.412 9 12415567.518 8 -25595925.487 8
49.950 8
15523362.628 5 -12597312.619 5 40.579 5 15523369.553 5 -9739990.371 5
42.003 5
12369614.129 8 -28797729.913 8 50.068 8 0.000 0.000
0.000
....
Currently I'm using String.split("\\\\s+")
to separate these numbers, then calling Double.parseDouble()
on each one of the parts of the String[]
, which looks something like: 目前,我正在使用
String.split("\\\\s+")
分隔这些数字,然后在String[]
每一部分上调用Double.parseDouble()
,如下所示:
String[] data = line.split("\\s+");
double firstValue = Double.parseDouble(data[0]);
double secondValue = Double.parseDouble(data[1]);
double thirdValue = Double.parseDouble(data[2]);
This ends up creating a lot of String
objects. 最终创建了很多
String
对象。 I also may have whitespace at the beginning or end of the line so I have to call trim()
on the line before I split it, which also creates another String
object. 我还可能在行的开头或结尾处有空格,因此在拆分之前必须在行上调用
trim()
,这还会创建另一个String
对象。 The garbage collector disposes of these String
objects, which results in slowdowns. 垃圾收集器处理这些
String
对象,这会导致速度降低。 Are there more memory efficient constructs in Java to do this? Java中是否有更多的内存有效构造可以做到这一点? I was thinking about using a
char[]
instead of a String
but I'm not sure whether there would be a substantial improvement from that. 我当时正在考虑使用
char[]
而不是String
但是我不确定是否会对此进行实质性的改进。
If you are really certain that this is a severe bottleneck you could always parse your string directly into Double
s. 如果您确实确定这是一个严重的瓶颈 ,则可以随时将字符串直接解析为
Double
。
// Keep track of my state.
private static class AsDoublesState {
// null means no number waiting for add.
Double d = null;
// null means not seen '.' yet.
Double div = null;
// Is it negative.
boolean neg = false;
void consume(List<Double> doubles, char ch) {
// Digit?
if ('0' <= ch && ch <= '9') {
double v = ch - '0';
if (d == null) {
d = v;
} else {
d = d * 10 + v;
}
// Count digits after the dot.
if (div != null) {
div *= 10;
}
} else if (ch == '.') {
// Decimal point - start tracking how much to divide by.
div = 1.0;
} else if (ch == '-') {
// Negate!
neg = true;
} else {
// Everything else completes the number.
if (d != null) {
if (neg) {
d = -d;
}
if (div != null) {
doubles.add(d / div);
} else {
doubles.add(d);
}
// Clear down.
d = null;
div = null;
neg = false;
}
}
}
}
private static List<Double> asDoubles(String s) {
// Grow the list.
List<Double> doubles = new ArrayList<>();
// Track my state.
AsDoublesState state = new AsDoublesState();
for (int i = 0; i < s.length(); i++) {
state.consume(doubles, s.charAt(i));
}
// Pretend a space on the end to flush an remaining number.
state.consume(doubles, ' ');
return doubles;
}
public void test() {
String s = "16916576.643 4 -12312674.246 4 39.785 4 16916584.123 3 -5937726.325 3 36.794 3";
List<Double> doubles = asDoubles(s);
System.out.println(doubles);
}
This will break badly if given bad data. 如果给出错误的数据,这将严重破坏。 Eg
123--56...392.86
would be a perfectly valid number to it, and 6.0221413e+23
would be two numbers. 例如
123--56...392.86
将是一个完全有效的数字,而6.0221413e+23
将是两个数字。
Here's an improved State
using AtomicDouble
to avoid creating all of those Double
objects`. 这是使用
AtomicDouble
的改进State
,可避免创建所有这些Double
对象。
// Keep track of my state.
private static class AsDoublesState {
// Mutable double value.
AtomicDouble d = new AtomicDouble();
// Mutable double value.
AtomicDouble div = new AtomicDouble();
// Is there a number.
boolean number = false;
// Is it negative.
boolean negative = false;
void consume(List<Double> doubles, char ch) {
// Digit?
if ('0' <= ch && ch <= '9') {
double v = ch - '0';
d.set(d.get() * 10 + v);
number = true;
// Count digits after the dot.
div.set(div.get() * 10);
} else if (ch == '.') {
// Decimal point - start tracking how much to divide by.
div.set(1.0);
} else if (ch == '-') {
// Negate!
negative = true;
} else {
// Everything else completes the number.
if (number) {
double v = d.get();
if (negative) {
v = -v;
}
if (div.get() != 0) {
v = v / div.get();
}
doubles.add(v);
// Clear down.
d.set(0);
div.set(0);
number = false;
negative = false;
}
}
}
}
Try using Pattern
and Matcher
to split the string with a compiled regular expression: 尝试使用
Pattern
和Matcher
用已编译的正则表达式拆分字符串:
double[][] out = new double[2][2];
String[] data = new String[2];
data[0] = "1 2";
data[1] = "3 2";
Pattern pat = Pattern.compile("\\s*(\\d+\\.?\\d*)?\\s+?(\\d+\\.?\\d*)?\\s*");
Matcher mat = pat.matcher(data[0]);
mat.find();
out[0][0] = Double.parseDouble(mat.group(1));
out[0][1] = Double.parseDouble(mat.group(2));
mat = pat.matcher(data[1]);
mat.find();
out[1][0] = Double.parseDouble(mat.group(1));
out[1][1] = Double.parseDouble(mat.group(2));
We have similar problem in our application where lot of strings getting created and we did few things which help us fixing the issue. 在我们的应用程序中,我们遇到了类似的问题,即创建了很多字符串,而我们所做的事情很少,可以帮助我们解决问题。
If you're on 32 bit JVM then you only assign up to 4 gb - theoratical limit. 如果您使用的是32位JVM,则最多只能分配4 GB(理论上的限制)。 So move to 64 bit.
因此移至64位。
Profile your application 分析您的应用程序
You need to do it step by step: 您需要逐步进行:
Start with visualvm (click here for detail) and measure how many String, Double objects are getting created, size, time etc. 从visualvm开始(单击此处了解详细信息),然后测量要创建多少个String,Double对象,大小,时间等。
Use one of the Flyweight pattern and intern string objects. 使用Flyweight模式和内部字符串对象之一。 Guava library has Interner .
番石榴图书馆有Interner 。 Even, you can do even double also.
甚至,您甚至可以做两倍。 This will avoid duplicate and cache the object using weak references, eg here
这样可以避免重复并使用弱引用来缓存对象,例如此处
Interner<String> interner = Interners.newWeakInterner(); String a = interner.intern(getStringFromCsv()); String b = interner.intern(getStringFromCsv());
You can also use scanner to read double from file, your code will be cleaner and scanner also cache double values as it uses Double.ValueOf
. 您还可以使用扫描仪从文件中读取double值,因为使用
Double.ValueOf
可以使代码更清晰,并且扫描仪还可以缓存double值。
Here's the code 这是代码
File file = new File("double_file.txt");
Scanner scan = new Scanner(file);
while(scan.hasNextDouble()) {
System.out.println( scan.nextDouble() );
}
scan.close();
You can use this code and see if there is any GAIN in the performance or not. 您可以使用此代码查看演奏中是否有增益。
You don't have to produce any garbage at all: 您根本不需要产生任何垃圾:
BufferedInputStream
in order to avoid char[]
and String
creation. BufferedInputStream
以避免生成char[]
和String
。 There are no non-ASCII characters, so you deal with bytes directly. double nextDouble()
, which reads characters until a next double is assembled. double nextDouble()
的方法,该方法读取字符,直到下一个double被组装为止。 \\n
(ignore \\r
as it's just a needless addendum to \\n
; lone \\r
used to be used as line separator long time ago). \\n
(忽略\\r
因为它只是\\n
的不必要的附录;很久以前,孤独的\\r
用作行分隔符)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.