简体   繁体   English

从文本文件中读取巨大的字符串行

[英]Reading huge line of string from text file

I have a large text file but doesn't have any line break. 我有一个很大的文本文件,但是没有换行符。 It just contains a long String (1 huge line of String with all ASCII characters), but so far anything works just fine as I can be able to read the whole line into memory in Java, but i am wondering if there could be a memory leak issue as the file becomes so big like 5GB+ and the program can't read the whole file into memory at once, so in that case what will be the best way to read such file ? 它只包含一个很长的字符串(包含所有ASCII字符的1行巨大的字符串),但是到目前为止,一切正常,因为我可以将整行读入Java的内存中,但是我想知道是否可能存在内存由于文件变得如此之大(如5GB +),并且程序无法一次将整个文件读入内存,因此出现泄漏问题,那么在那种情况下,读取此类文件的最佳方法是什么? Can we break the huge line into 2 parts or even multiple chunks ? 我们可以将巨大的线条分成2个部分,甚至多个部分吗?

Here's how I read the file 这是我读取文件的方式

   BufferedReader buf = new BufferedReader(new FileReader("input.txt"));
   String line;
   while((line = buf.readLine()) != null){

   }

A single String can be only 2 billion characters long and will use 2 byte per character, so if you could read a 5 GB line it would use 10 GB of memory. 一个String只能有20亿个字符长,每个字符将使用2个字节,因此,如果您读取5 GB的行,则将使用10 GB的内存。

I suggest you read the text in blocks. 我建议您分块阅读文本。

Reader reader = new FileReader("input.txt");
try {
    char[] chars = new char[8192];
    for(int len; (len = reader.read(chars)) > 0;) {
        // process chars.
    }
} finally {
    reader.close();
}

This will use about 16 KB regardless of the size of the file. 无论文件大小如何,这都将使用约16 KB。

There won't be any kind of memory-leak , as the JVM has its own garbage collector. 不会有任何类型的内存泄漏 ,因为JVM有自己的垃圾收集器。 However you will probably run out of heap space. 但是,您可能会用完堆空间。

In cases like this, it is always best to import and process the stream in manageable pieces. 在这种情况下,始终最好以可管理的方式导入和处理流。 Read in 64MB or so and repeat. 读入64MB左右,然后重复。

You also might find it useful to add the -Xmx parameter to your java call, in order to increase the maximum heap space available within the JVM. 您可能还会发现将-Xmx参数添加到java调用中很有用,以增加JVM中可用的最大堆空间。

its better to read the file in chunks and then concatenate the chunks or do whatever you want wit it, because if it is a big file you are reading you will get heap space issues 最好分块读取文件,然后将其连接起来或执行任何您想使用的文件,因为如果读取的文件很大,则会出现堆空间问题

an easy way to do it like below 像下面这样简单的方法

  InputStream is;
  OutputStream os;

  byte buffer[] = new byte[1024];
  int read;
  while((read = is.read(buffer)) != -1)
  {
      // do whatever you need with the buffer
  }

In addition to the idea of reading in chunks, you could also look at memory mapping areas of the file using java.nio.MappedByteBuffer. 除了读取块的想法外,您还可以使用java.nio.MappedByteBuffer查看文件的内存映射区域。 You would still be limited to a maximum buffer size of Integer.MAX_VALUE. 您仍然将最大缓冲区大小限制为Integer.MAX_VALUE。 This may be better than explicitly reading chunks if you will be making scattered accesses within a chunk. 如果要在块中进行分散的访问,这可能比显式读取块更好。

To read chunks from file or write same to some file this could be used: 要从文件读取块或将其写入某些文件,可以使用以下方法:

{
in = new FileReader("input.txt");
out = new FileWriter("output.txt");
char[] buffer = new char[1024];
int l = 0;
while ( (l = in.read(buffer)) > 0 ) {
    out.write(buffer, 0, l);
}

You won't run into any memory leak issues, but possible heap space issues. 您不会遇到任何内存泄漏问题,但是可能会遇到堆空间问题。 To avoid heap issues, use a buffer. 为避免堆问题,请使用缓冲区。

It all depends on how you are currently reading the line. 这完全取决于您当前如何阅读该行。 It is possible to avoid all heap issues by using a buffer. 使用缓冲区可以避免所有堆问题。

public void readLongString(String superlongString, int size, BufferedReader in){
  char[] buffer = new char[size];
  for(int i=0;i<superlongString.length;i+=size;){
       in.read(buffer, i, size)
       //do stuff 
     }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM