简体   繁体   中英

exception while Read very large file > 300 MB

My task is to open a large file in READ&WRITE mode and i need to search some portion of text in that file by searching starting and end point. Then i need to write that searched area of text to a new file and delete that portion from the original file.

The above process i will do more times. So I thought that for these process, it will be easy by loading the file into memory by CharBuffer and can search easily by MATCHER class. But im getting HeapSpace exception while reading, even though i increased to 900MB by executing like below java -Xms128m -Xmx900m readLargeFile My code is

FileChannel fc = new FileInputStream(fFile).getChannel();
CharBuffer chrBuff = Charset.forName("8859_1").newDecoder().decode(fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size()));

For that above code every one suggested me that its a bad idea to load everything into memory and If file size is 300 MB means, it will be 600MB due to charSet.

So above is my task, then now suggest me some efficient ways. Note that my file size will be more and using JAVA only i've to do these things.

Thanks in Advance...

You definitely do NOT want to load a 300MB file into a single large buffer with Java. The way you're doing things is supposed to be more efficient for large files than just using normal I/O, but when you run a Matcher against an entire file mapped into memory as you are, you can very easily exhaust memory.

First, your code memory maps the file into memory ... this will consume 300 Meg of memory in your virtual address space as the file is mmap ed into it, although this is outside the heap. (Note that the 300 Meg of virtual address space is tied up until the MappedByteBuffer is garbage collected . See below for discussion. The JavaDoc for map warns you about this.) Next, you create a ByteBuffer backed by this mmap ed file. This should be fine, as it's just a "view" of the mmap ed file and should thus take minimal extra memory. It will be a small object in the heap with a "pointer" to a large object outside the heap. Next, you decode this into a CharBuffer , which means you make a copy of the 300 MB buffer, but you make a 600 MB copy (on the heap) because a char is 2 bytes.

To respond to a comment, and looking at the JDK Source code to be sure, when you call map() as the OP is, you do in fact map the entire file into memory. Looking at openJDK 6 b14 Windows native code sun.nio.ch.FileChannelImpl.c , it first calls CreateFileMapping , then calls MapViewOfFile . Looking at this source, if you ask to map the whole file into memory, this method will do exactly as you ask. To quote MSDN:

Mapping a file makes the specified portion of a file visible in the address space of the calling process.

For files that are larger than the address space, you can only map a small portion of the file data at one time. When the first view is complete, you can unmap it and map a new view.

The way the OP is calling map, the "specified portion" of the file is the entire file. This won't contribute to heap exhaustion, but it can contribute to virtual address space exhaustion, which is still an OOM error. This can kill your application just as thoroughly as running out of heap.

Finally, when you make a Matcher , the Matcher potentially makes more copies of this 600 MB CharBuffer , depending on how you use it. Ouch. That's a lot of memory used by a small number of objects! Given a Matcher , every time you call toMatchResult() , you'll make a String copy of the entire CharBuffer . Also, every time you call replaceAll() , at best you will make a String copy of the entire CharBuffer . At worst you will make a StringBuffer that will slowly be expanded to the full size of the replaceAll result (applying a lot of memory pressure on the heap), and then make a String from that.

Thus, if you call replaceAll on a Matcher against a 300 MB file, and your match is found, then you'll first make a series of ever-larger StringBuffer s until you get one that is 600 MB. Then you'll make a String copy of this StringBuffer . This can quickly and easily lead to heap exhaustion.

Here's the bottom line: Matcher s are not optimized for working on very large buffers. You can very easily, and without planning to, make a number of very large objects. I discovered this when doing something similar enough to what you're doing and encountering memory exhaustion, then looking at the source code for Matcher .

NOTE: There is no unmap call. Once you call map , the virtual address space outside the heap tied up by the MappedByteBuffer is stuck there until the MappedByteBuffer is garbage collected. As a result, you will be unable to perform certain operations on the file (delete, rename, ...) until the MappedByteBuffer is garbage collected. If call map enough times on different files, but don't have sufficient memory pressure in the heap to force a garbage collection, you can out of memory outside the heap. For a discussion, see Bug 4724038 .

As a result of all of the discussion above, if you will be using it to make a Matcher on large files, and you will be using replaceAll on the Matcher , then memory mapped I/O is probably not the way to go. It will simply create too many large objects on the heap as well as using up a lot of your virtual address space outside the heap. Under 32 bit Windows, you have only 2GB (or if you have changed settings, 3GB) of virtual address space for the JVM, and this will apply significant memory pressure both inside and outside the heap.

I apologize for the length of this answer, but I wanted to be thorough. If you think any part of the above is wrong, please comment and say so. I will not do retaliatory downvotes. I am very positive that all of the above is accurate, but if something is wrong, I want to know.

Does your search pattern match more than one line? If not then the easiest solution is to read line by line :). Simple really

But if the search pattern matches multiple lines then you need to let us know because searching line by line will not work.

Using a buffer to read a bulk of file one time There is one trick: each time your read a new string into buffer, make sure it has a overlap of length l, which is the length of substring l = length(substring); while (not eof) do begin if find(buffer,substring) return TRUE;
buffer[0..l] = substring; buffer[l+1, end] = read_new_chars_intobuffer; end

Claims that the FileChannel.map will load the entire file into memory are faulty, with reference to the MappedByteBuffer that FileChannel.map() returns. It is a 'Direct Byte Buffer', it not exhaust your memory (Direct Byte Buffers use the OS virtual memory subsystem to page data in and out of memory as required, allowing one to address much larger chunks of memory as they were physical RAM.) But then again, a single MBB will only work for files to ~2GB.

Try this:

FileChannel fc = new FileInputStream(fFile).getChannel();
MappedByteBuffer mbb = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size());

CharBuffer chrBuff = mbb.asCharBuffer();

It will not load the entire file into memory, and the chrBuff is only a view of the backing MappedByteBuffer, and not a copy.

I'm not sure how to handle the decoding, though.

就我而言,在类路径之后添加-Djava.compiler=NONE可以解决此问题。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM