简体   繁体   English

如何在Java中有效地处理字符串?

[英]How to handle strings efficiently in java?

There is a compressed file, first I need to decompress it, then read the contents of the line and process each line of data by splitting the two fields and using one of them as the key, then encrypt another field. 有一个压缩文件,首先我需要解压缩它,然后读取行的内容,并通过拆分两个字段并将其中一个用作密钥来处理每一行数据,然后对另一个字段进行加密。 Some code is as follows: 一些代码如下:

try (GZIPInputStream stream = new GZIPInputStream(new ByteArrayInputStream(event.getBody()));
     BufferedReader br = new BufferedReader(new InputStreamReader(stream))) {
    String line;
    StringBuilder builder = new StringBuilder();
    while ((line = br.readLine()) != null) {
        builder.append(line);
        this.handleLine(builder);
        builder.setLength(0);
        builder.trimToSize();
    }
} catch (Exception e) {
    // ignore
}
  1. Each compressed package has about three million rows, so how to handle strings efficiently in the loop is the key to the performance of the entire program. 每个压缩包大约有三百万行,因此如何在循环中有效处理字符串是整个程序性能的关键。
  2. Is it correct to use StringBuilder like this? 这样使用StringBuilder是否正确?
  3. The format of each line of data is as follows : aaa|bbb|ccc|ddd|eee|fff|ggg|hhh . 每行数据的格式如下: aaa|bbb|ccc|ddd|eee|fff|ggg|hhh

What I want to know is how to correctly use String and StringBuilder in this extremely large amount of data loop. 我想知道的是如何在这种数量巨大的数据循环中正确使用StringStringBuilder

For handling many individual items in a loop there's basically 2 possible sources of trouble related to memory management: 要在一个循环中处理许多单独的项目,基本上有两种与内存管理有关的麻烦源:

  1. keeping unnecessary per-item data in memory, thus creating a memory leak 在内存中保留不必要的每项数据,从而导致内存泄漏
  2. creating large amounts of memory churn by allocating too much memory and/or too many individual objects for each individual item you handle. 通过为您处理的每个项目分配过多的内存和/或太多的单个对象来创建大量的内存流失。

Violating #1 would mean that your total memory usage would increase throughout the loop and thus create an upper limit to how many items you can handle. 违反#1意味着您的总内存使用量将在整个循环中增加,从而为您可以处理的项目数设置了上限。

Violating #2 would " only " cause more garbage collection pauses and not cause your application to fail (ie it'd slow down, but still work). 违反#2 只会 “造成”更多的垃圾回收暂停,而不会导致您的应用程序失败(即,它速度变慢,但仍然可以运行)。

If you actually need the StringBuilder (as indicated by your comment) then you should get rid of the trimToSize() call (as Stephen C correctly commented), because it will basically force the StringBuilder to re-allocate space for the content of line in each iteration (effectively gaining you very, very little over just plain re-creating the StringBuilder in each iteration). 如果您实际上需要StringBuilder (如您的注释所示),则应该摆脱trimToSize()调用(正如Stephen C正确注释的那样),因为它基本上会强制StringBuilder为中的line内容重新分配空间。每次迭代(仅在每次迭代中简单地重新创建StringBuilder有效地使您StringBuilder )。

The only drawback of removing that call is that the memory used by StringBuilder will never be reduced until the loop has finished. 删除该调用的唯一缺点是StringBuilder使用的内存永远不会减少,直到循环完成为止。

As long as there are no extreme outliers in line length in that file that is probably not a problem. 只要该文件中的行长没有极端的异常,那可能就不是问题。

As an additional side-note: you mention that String.split is too inefficient for you. 作为一个补充说明:您提到String.split对您来说效率太低。 A major source of that inefficiency is the fact that it needs to re-compile the regular expression every time. 效率低下的一个主要原因是它每次都需要重新编译正则表达式。 If you use pre-compile the pattern outside of the loop using Pattern.compile and then call Pattern.split() inside the loop, then that might already be much quicker. 如果您使用Pattern.compile在循环外部使用预编译模式,然后在循环内部调用Pattern.split() ,则可能已经快得多了。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM