简体   繁体   中英

Java Reading big file java heap space

I have written this code:

try(BufferedReader file = new BufferedReader(new FileReader("C:\\Users\\User\\Desktop\\big50m.txt"));){
              String line;
              StringTokenizer st;

              while ((line = file.readLine()) != null){
                  st  = new StringTokenizer(line); // Separation of integers of the file line
                  while(st.hasMoreTokens())
                       numbers.add(Integer.parseInt(st.nextToken())); //Converting and adding to the list of numbers
                  }

          }
          catch(Exception e){
              System.out.println("Can't read the file...");

          }

the big50m file has 50.000.000 integers and i get this runtime error:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:3332)
    at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
    at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:596)
    at java.lang.StringBuffer.append(StringBuffer.java:367)
    at java.io.BufferedReader.readLine(BufferedReader.java:370)
    at java.io.BufferedReader.readLine(BufferedReader.java:389)
    at unsortedfilesapp.UnsortedFilesApp.main(UnsortedFilesApp.java:37)
C:\Users\User\AppData\Local\NetBeans\Cache\8.2\executor-snippets\run.xml:53: Java returned: 1
BUILD FAILED (total time: 5 seconds)

I think the problem is the string variable named line . Can you tell me how to fix it ? Because i want fast reading i use StringTokenizer.

Create a BufferedReader from the file and read() char by char. Put digit char into a String , then Integer.parseInt() , skip any non-digit char and continue parsing on the the next digit, etc, etc.

here is an version that minimize the memory usage. No byte to char conversion. No String operations. But in this version it does not handle negative numbers.

    public static void main(final String[]a) {
        final Set<Integer> number = new HashSet<>();
        int v = 0;
        boolean use = false;
        int c;
        // Input stream avoid char conversion
        try(InputStream s = new FileInputStream("C:\\Users\\User\\Desktop\\big50m.txt")) {
            // No allocation in the loop
            do {
                if((c = s.read()) == -1) break;
                if(c>='0' && c<='9') { v = v * 10 + c-'0'; use =     true; continue; }
                if(use) number.add(v);
                use = false;
                v = 0;
            } while(true);
            if(use) number.add(v);
        } catch(final Exception e){ System.out.println("Can't read the file..."); }
    }

The readLine() method reads the whole line at once thus eating up a lot of memory. This is highly inefficient and does not scale to an arbitrary big file.

You can use a StreamTokenizer

like this:

StreamTokenizer tokenizer = new StreamTokenizer(new FileReader("bigfile.txt"));
tokenizer.parseNumbers(); // default behaviour
while (tokenizer.nextToken() != StreamTokenizer.TT_EOF) {
    if (tokenizer.ttype == StreamTokenizer.TT_NUMBER) {
        numbers.add((int)Math.round(tokenizer.nval));
    }
}

I have not tested this code but it gives you the general idea.

在使用-Xmx2048m运行程序时,提供的代码段起作用(经过一些调整:声明的数字为List number = new ArrayList <>(50000000);)

Since all numbers are within one line, the BufferedReader approach does not work or scale well. The complete file will be read into memory. Therefore the streaming approach (eg from @whbogado) is indeed the way to go.

StreamTokenizer tokenizer = new StreamTokenizer(new FileReader("bigfile.txt"));
tokenizer.parseNumbers(); // default behaviour
while (tokenizer.nextToken() != StreamTokenizer.TT_EOF) {
    if (tokenizer.ttype == StreamTokenizer.TT_NUMBER) {
        numbers.add((int)Math.round(tokenizer.nval));
    }
}

As you are writing, that you are getting a heap space error as well, I assume, that it is not a problem with the streaming anymore. Unfortunately you are storing all values within a List. I think that is the problem now. You say in a comment, that you do not know the actual count of numbers. Hence you should avoid to store those in a list and do here as well some kind of streaming.

For all who are interested, here is my little testcode (java 8) that does produce a testfile of the needed size USED_INT_VALUES . I limited it for now to 5 000 000 integers. As you can see running it, the memory increases steadily while reading through the file. The only place that holds that much memory is the numbers List .

Be aware that initializing an ArrayList with an initial capacity does not allocate the memory the stored objects need, in your case your Integers .

import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.io.StreamTokenizer;
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
import java.util.logging.Level;
import java.util.logging.Logger;

public class TestBigFiles {

    public static void main(String args[]) throws IOException {
        heapStatistics("program start");
        final int USED_INT_VALUES = 5000000;
        File tempFile = File.createTempFile("testdata_big_50m", ".txt");
        System.out.println("using file " + tempFile.getAbsolutePath());
        tempFile.deleteOnExit();

        Random rand = new Random();
        FileWriter writer = new FileWriter(tempFile);
        rand.ints(USED_INT_VALUES).forEach(i -> {
            try {
                writer.write(i + " ");
            } catch (IOException ex) {
                Logger.getLogger(TestBigFiles.class.getName()).log(Level.SEVERE, null, ex);
            }
        });
        writer.close();
        heapStatistics("large file generated - size=" + tempFile.length() + "Bytes");
        List<Integer> numbers = new ArrayList<>(USED_INT_VALUES);

        heapStatistics("large array allocated (to avoid array copy)");

        int c = 0;
        try (FileReader fileReader = new FileReader(tempFile);) {
            StreamTokenizer tokenizer = new StreamTokenizer(fileReader);

            while (tokenizer.nextToken() != StreamTokenizer.TT_EOF) {
                if (tokenizer.ttype == StreamTokenizer.TT_NUMBER) {
                    numbers.add((int) tokenizer.nval);
                    c++;
                }
                if (c % 100000 == 0) {
                    heapStatistics("within loop count " + c);
                }
            }
        }

        heapStatistics("large file parsed nummer list size is " + numbers.size());
    }

    private static void heapStatistics(String message) {
        int MEGABYTE = 1024 * 1024;
        //clean up unused stuff
        System.gc();
        Runtime runtime = Runtime.getRuntime();
        System.out.println("##### " + message + " #####");

        System.out.println("Used Memory:" + (runtime.totalMemory() - runtime.freeMemory()) / MEGABYTE + "MB"
                + " Free Memory:" + runtime.freeMemory() / MEGABYTE + "MB"
                + " Total Memory:" + runtime.totalMemory() / MEGABYTE + "MB"
                + " Max Memory:" + runtime.maxMemory() / MEGABYTE + "MB");
    }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM