简体   繁体   English

使用Java从文件中读取大量数据

[英]Read large amount of data from file in Java

I've got text file that contains 1 000 002 numbers in following formation: 我有以下格式的文本文件,其中包含1 000 002数字:

123 456
1 2 3 4 5 6 .... 999999 100000

Now I need to read that data and allocate it to int variables (the very first two numbers) and all the rest ( 1 000 000 numbers) to an array int[] . 现在,我需要读取该数据并将其分配给int变量(前两个数字),而将其余所有( 1 000 000个数字)分配给数组int[]

It's not a hard task, but - it's horrible slow. 这不是一项艰巨的任务,但是-这太慢了。

My first attempt was java.util.Scanner : 我的第一次尝试是java.util.Scanner

 Scanner stdin = new Scanner(new File("./path"));
 int n = stdin.nextInt();
 int t = stdin.nextInt();
 int array[] = new array[n];

 for (int i = 0; i < n; i++) {
     array[i] = stdin.nextInt();
 }

It works as excepted but it takes about 7500 ms to execute. 它按例外方式工作,但执行大约需要7500毫秒 I need to fetch that data in up to several hundred of milliseconds. 我需要在几百毫秒内获取该数据。

Then I tried java.io.BufferedReader : 然后我尝试了java.io.BufferedReader

Using BufferedReader.readLine() and String.split() I got the same results in about 1700 ms , but it's still too many. 使用BufferedReader.readLine()String.split()我在大约1700毫秒内得到了相同的结果,但是仍然太多。

How can I read that amount of data in less that 1 second? 如何在不到1秒的时间内读取该数据量? The final result should be equal to: 最终结果应等于:

int n = 123;
int t = 456;
int array[] = { 1, 2, 3, 4, ..., 999999, 100000 };

According to trashgod answer: 根据trashgod的回答:

StreamTokenizer solution is fast (takes about 1400 ms) but it's still too slow: StreamTokenizer解决方案速度很快(大约需要1400毫秒),但仍然太慢:

StreamTokenizer st = new StreamTokenizer(new FileReader("./test_grz"));
st.nextToken();
int n = (int) st.nval;

st.nextToken();
int t = (int) st.nval;

int array[] = new int[n];

for (int i = 0; st.nextToken() != StreamTokenizer.TT_EOF; i++) {
    array[i] = (int) st.nval;
}

PS. PS。 There is no need for validation. 无需验证。 I'm 100% sure that data in ./test_grz file is correct. 我100%确定./test_grz文件中的数据正确。

Thanks for every answer but I've already found a method that meets my criteria: 感谢您提供每个答案,但我已经找到了一种符合我的标准的方法:

BufferedInputStream bis = new BufferedInputStream(new FileInputStream("./path"));
int n = readInt(bis);
int t = readInt(bis);
int array[] = new int[n];
for (int i = 0; i < n; i++) {
    array[i] = readInt(bis);
}

private static int readInt(InputStream in) throws IOException {
    int ret = 0;
    boolean dig = false;

    for (int c = 0; (c = in.read()) != -1; ) {
        if (c >= '0' && c <= '9') {
            dig = true;
            ret = ret * 10 + c - '0';
        } else if (dig) break;
    }

    return ret;
}

It requires only about 300 ms to read 1 mln of integers! 读取1百万个整数只需要300毫秒

StreamTokenizer可能会更快,如建议在这里

You can reduce the time for the StreamTokenizer result by using a BufferedReader : 您可以使用BufferedReader减少StreamTokenizer结果的时间:

Reader r = null;
try {
    r = new BufferedReader(new FileReader(file));
    final StreamTokenizer st = new StreamTokenizer(r);
    ...
} finally {
    if (r != null)
        r.close();
}

Also, don't forget to close your files, as I've shown here. 另外,请不要忘记关闭文件,如我在此处所示。

You can also shave some more time off by using a custom tokenizer just for your purposes: 您还可以通过使用仅用于您的目的的自定义标记器来节省更多时间:

public class CustomTokenizer {

    private final Reader r;

    public CustomTokenizer(final Reader r) {
        this.r = r;
    }

    public int nextInt() throws IOException {
        int i = r.read();
        if (i == -1)
            throw new EOFException();

        char c = (char) i;

        // Skip any whitespace
        while (c == ' ' || c == '\n' || c == '\r') {
            i = r.read();
            if (i == -1)
                throw new EOFException();
            c = (char) i;
        }

        int result = (c - '0');
        while ((i = r.read()) >= 0) {
            c = (char) i;
            if (c == ' ' || c == '\n' || c == '\r')
                break;
            result = result * 10 + (c - '0');
        }

        return result;
    }

}

Remember to use a BufferedReader for this. 请记住为此使用BufferedReader This custom tokenizer assumes the input data is always completely valid and contains only spaces, new lines, and digits. 此自定义标记器假定输入数据始终完全有效,并且仅包含空格,换行和数字。

If you read these results a lot and those results do not change much, you should probably save the array and keep track of the last file modified time. 如果您大量阅读这些结果,而这些结果并没有太大变化,则可能应该保存阵列并跟踪上次文件修改时间。 Then, if the file has not changed just use the cached copy of the array and this will speed up the results significantly. 然后,如果文件未更改,则仅使用阵列的缓存副本,这将大大加快结果的速度。 For example: 例如:

public class ArrayRetriever {

    private File inputFile;
    private long lastModified;
    private int[] lastResult;

    public ArrayRetriever(File file) {
        this.inputFile = file;
    }

    public int[] getResult() {
        if (lastResult != null && inputFile.lastModified() == lastModified)
            return lastResult;

        lastModified = inputFile.lastModified();

        // do logic to actually read the file here

        lastResult = array; // the array variable from your examples
        return lastResult;
    }

}

How much memory do you have in the computer? 您的计算机中有多少内存? You could be running into GC issues. 您可能遇到了GC问题。

The best thing to do is to process the data one line at a time if possible. 最好的做法是,如果可能,一次只处理一行数据。 Don't load it into an array. 不要将其加载到数组中。 Load what you need, process, write it out, and continue. 加载所需的内容,进行处理,将其写出,然后继续。

This will reduce your memory footprint and still use the same amount of File IO 这将减少您的内存占用,并且仍使用相同数量的文件IO

It it's possible to reformat the input so that each integer is on a separate line (instead of one long line with one million integers), you should be seeing much improved performance using Integer.parseInt(BufferedReader.readLine()) due to smarter buffering by line and not having to split the long string into a separate array of Strings. 可以重新格式化输入,以使每个整数都位于单独的行上(而不是具有100万个整数的长行),由于更智能的缓冲,使用Integer.parseInt(BufferedReader.readLine())应该会看到性能大大提高。不必按行将长字符串拆分为单独的字符串数组。

Edit: I tested this and managed to read the output produced by seq 1 1000000 into an array of int well under half a second, but of course this depends on the machine. 编辑:我对此进行了测试,并设法在不到半秒的时间内将seq 1 1000000产生的输出读入一个int数组,但这当然取决于机器。

I would extend FilterReader and parse the string as it is read in the read() method. 我将扩展FilterReader并解析在read()方法中读取的字符串。 Have a getNextNumber method return the numbers. 有一个getNextNumber方法返回数字。 Code left as an exercise for the reader. 代码留给读者练习。

Use a StreamTokenizer on a BufferedReader will give you quite good performance already. 在BufferedReader上使用StreamTokenizer将已经为您提供了相当不错的性能。 You shouldn't need to write your own readInt() function. 您不需要编写自己的readInt()函数。

Here is the code I used to do some local performance testing: 这是我用来进行一些本地性能测试的代码:

/**
 * Created by zhenhua.xu on 11/27/16.
 */
public class MyReader {

private static final String FILE_NAME = "./1m_numbers.txt";
private static final int n = 1000000;

public static void main(String[] args) {
    try {
        readByScanner();
        readByStreamTokenizer();
        readByStreamTokenizerOnBufferedReader();
        readByBufferedInputStream();
    } catch (Exception e) {
        e.printStackTrace();
    }
}

public static void readByScanner() throws Exception {
    long startTime = System.currentTimeMillis();

    Scanner stdin = new Scanner(new File(FILE_NAME));
    int array[] = new int[n];
    for (int i = 0; i < n; i++) {
        array[i] = stdin.nextInt();
    }

    long endTime = System.currentTimeMillis();
    System.out.println(String.format("Total time by Scanner: %d ms", endTime - startTime));
}

public static void readByStreamTokenizer() throws Exception {
    long startTime = System.currentTimeMillis();

    StreamTokenizer st = new StreamTokenizer(new FileReader(FILE_NAME));
    int array[] = new int[n];

    for (int i = 0; st.nextToken() != StreamTokenizer.TT_EOF; i++) {
        array[i] = (int) st.nval;
    }

    long endTime = System.currentTimeMillis();
    System.out.println(String.format("Total time by StreamTokenizer: %d ms", endTime - startTime));
}

public static void readByStreamTokenizerOnBufferedReader() throws Exception {
    long startTime = System.currentTimeMillis();

    StreamTokenizer st = new StreamTokenizer(new BufferedReader(new FileReader(FILE_NAME)));
    int array[] = new int[n];

    for (int i = 0; st.nextToken() != StreamTokenizer.TT_EOF; i++) {
        array[i] = (int) st.nval;
    }

    long endTime = System.currentTimeMillis();
    System.out.println(String.format("Total time by StreamTokenizer with BufferedReader: %d ms", endTime - startTime));
}

public static void readByBufferedInputStream() throws Exception {
    long startTime = System.currentTimeMillis();

    BufferedInputStream bis = new BufferedInputStream(new FileInputStream(FILE_NAME));
    int array[] = new int[n];
    for (int i = 0; i < n; i++) {
        array[i] = readInt(bis);
    }

    long endTime = System.currentTimeMillis();
    System.out.println(String.format("Total time with BufferedInputStream: %d ms", endTime - startTime));
}

private static int readInt(InputStream in) throws IOException {
    int ret = 0;
    boolean dig = false;

    for (int c = 0; (c = in.read()) != -1; ) {
        if (c >= '0' && c <= '9') {
            dig = true;
            ret = ret * 10 + c - '0';
        } else if (dig) break;
    }

    return ret;
}

Results I got: 我得到的结果:

  • Total time by Scanner: 789 ms 扫描仪总时间:789毫秒
  • Total time by StreamTokenizer: 226 ms StreamTokenizer的总时间:226 ms
  • Total time by StreamTokenizer with BufferedReader: 80 ms StreamTokenizer和BufferedReader的总时间:80毫秒
  • Total time by BufferedInputStream: 95 ms BufferedInputStream的总时间:95毫秒

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM