使用Java从文件中读取大量数据

Question

我有以下格式的文本文件，其中包含1 000 002数字：

123 456
1 2 3 4 5 6 .... 999999 100000

现在，我需要读取该数据并将其分配给int变量（前两个数字），而将其余所有（ 1 000 000个数字）分配给数组int[] 。

这不是一项艰巨的任务，但是-这太慢了。

我的第一次尝试是`java.util.Scanner` ：

 Scanner stdin = new Scanner(new File("./path"));
 int n = stdin.nextInt();
 int t = stdin.nextInt();
 int array[] = new array[n];

 for (int i = 0; i < n; i++) {
     array[i] = stdin.nextInt();
 }

它按例外方式工作，但执行大约需要7500毫秒 。 我需要在几百毫秒内获取该数据。

然后我尝试了`java.io.BufferedReader` ：

使用BufferedReader.readLine()和String.split()我在大约1700毫秒内得到了相同的结果，但是仍然太多。

如何在不到1秒的时间内读取该数据量？ 最终结果应等于：

int n = 123;
int t = 456;
int array[] = { 1, 2, 3, 4, ..., 999999, 100000 };

根据trashgod的回答：

StreamTokenizer解决方案速度很快（大约需要1400毫秒），但仍然太慢：

StreamTokenizer st = new StreamTokenizer(new FileReader("./test_grz"));
st.nextToken();
int n = (int) st.nval;

st.nextToken();
int t = (int) st.nval;

int array[] = new int[n];

for (int i = 0; st.nextToken() != StreamTokenizer.TT_EOF; i++) {
    array[i] = (int) st.nval;
}

PS。 无需验证。 我100％确定./test_grz文件中的数据正确。

Answer 1

感谢您提供每个答案，但我已经找到了一种符合我的标准的方法：

BufferedInputStream bis = new BufferedInputStream(new FileInputStream("./path"));
int n = readInt(bis);
int t = readInt(bis);
int array[] = new int[n];
for (int i = 0; i < n; i++) {
    array[i] = readInt(bis);
}

private static int readInt(InputStream in) throws IOException {
    int ret = 0;
    boolean dig = false;

    for (int c = 0; (c = in.read()) != -1; ) {
        if (c >= '0' && c <= '9') {
            dig = true;
            ret = ret * 10 + c - '0';
        } else if (dig) break;
    }

    return ret;
}

读取1百万个整数只需要300毫秒 ！

Answer 2

StreamTokenizer可能会更快，如建议在这里。

Answer 3

您可以使用BufferedReader减少StreamTokenizer结果的时间：

Reader r = null;
try {
    r = new BufferedReader(new FileReader(file));
    final StreamTokenizer st = new StreamTokenizer(r);
    ...
} finally {
    if (r != null)
        r.close();
}

另外，请不要忘记关闭文件，如我在此处所示。

您还可以通过使用仅用于您的目的的自定义标记器来节省更多时间：

public class CustomTokenizer {

    private final Reader r;

    public CustomTokenizer(final Reader r) {
        this.r = r;
    }

    public int nextInt() throws IOException {
        int i = r.read();
        if (i == -1)
            throw new EOFException();

        char c = (char) i;

        // Skip any whitespace
        while (c == ' ' || c == '\n' || c == '\r') {
            i = r.read();
            if (i == -1)
                throw new EOFException();
            c = (char) i;
        }

        int result = (c - '0');
        while ((i = r.read()) >= 0) {
            c = (char) i;
            if (c == ' ' || c == '\n' || c == '\r')
                break;
            result = result * 10 + (c - '0');
        }

        return result;
    }

}

请记住为此使用BufferedReader 。 此自定义标记器假定输入数据始终完全有效，并且仅包含空格，换行和数字。

如果您大量阅读这些结果，而这些结果并没有太大变化，则可能应该保存阵列并跟踪上次文件修改时间。 然后，如果文件未更改，则仅使用阵列的缓存副本，这将大大加快结果的速度。 例如：

public class ArrayRetriever {

    private File inputFile;
    private long lastModified;
    private int[] lastResult;

    public ArrayRetriever(File file) {
        this.inputFile = file;
    }

    public int[] getResult() {
        if (lastResult != null && inputFile.lastModified() == lastModified)
            return lastResult;

        lastModified = inputFile.lastModified();

        // do logic to actually read the file here

        lastResult = array; // the array variable from your examples
        return lastResult;
    }

}

Answer 4

您的计算机中有多少内存？ 您可能遇到了GC问题。

最好的做法是，如果可能，一次只处理一行数据。 不要将其加载到数组中。 加载所需的内容，进行处理，将其写出，然后继续。

这将减少您的内存占用，并且仍使用相同数量的文件IO

Answer 5

可以重新格式化输入，以使每个整数都位于单独的行上（而不是具有100万个整数的长行），由于更智能的缓冲，使用Integer.parseInt(BufferedReader.readLine())应该会看到性能大大提高。不必按行将长字符串拆分为单独的字符串数组。

编辑：我对此进行了测试，并设法在不到半秒的时间内将seq 1 1000000产生的输出读入一个int数组，但这当然取决于机器。

Answer 6

我将扩展FilterReader并解析在read（）方法中读取的字符串。 有一个getNextNumber方法返回数字。 代码留给读者练习。

Answer 7

在BufferedReader上使用StreamTokenizer将已经为您提供了相当不错的性能。 您不需要编写自己的readInt（）函数。

这是我用来进行一些本地性能测试的代码：

/**
 * Created by zhenhua.xu on 11/27/16.
 */
public class MyReader {

private static final String FILE_NAME = "./1m_numbers.txt";
private static final int n = 1000000;

public static void main(String[] args) {
    try {
        readByScanner();
        readByStreamTokenizer();
        readByStreamTokenizerOnBufferedReader();
        readByBufferedInputStream();
    } catch (Exception e) {
        e.printStackTrace();
    }
}

public static void readByScanner() throws Exception {
    long startTime = System.currentTimeMillis();

    Scanner stdin = new Scanner(new File(FILE_NAME));
    int array[] = new int[n];
    for (int i = 0; i < n; i++) {
        array[i] = stdin.nextInt();
    }

    long endTime = System.currentTimeMillis();
    System.out.println(String.format("Total time by Scanner: %d ms", endTime - startTime));
}

public static void readByStreamTokenizer() throws Exception {
    long startTime = System.currentTimeMillis();

    StreamTokenizer st = new StreamTokenizer(new FileReader(FILE_NAME));
    int array[] = new int[n];

    for (int i = 0; st.nextToken() != StreamTokenizer.TT_EOF; i++) {
        array[i] = (int) st.nval;
    }

    long endTime = System.currentTimeMillis();
    System.out.println(String.format("Total time by StreamTokenizer: %d ms", endTime - startTime));
}

public static void readByStreamTokenizerOnBufferedReader() throws Exception {
    long startTime = System.currentTimeMillis();

    StreamTokenizer st = new StreamTokenizer(new BufferedReader(new FileReader(FILE_NAME)));
    int array[] = new int[n];

    for (int i = 0; st.nextToken() != StreamTokenizer.TT_EOF; i++) {
        array[i] = (int) st.nval;
    }

    long endTime = System.currentTimeMillis();
    System.out.println(String.format("Total time by StreamTokenizer with BufferedReader: %d ms", endTime - startTime));
}

public static void readByBufferedInputStream() throws Exception {
    long startTime = System.currentTimeMillis();

    BufferedInputStream bis = new BufferedInputStream(new FileInputStream(FILE_NAME));
    int array[] = new int[n];
    for (int i = 0; i < n; i++) {
        array[i] = readInt(bis);
    }

    long endTime = System.currentTimeMillis();
    System.out.println(String.format("Total time with BufferedInputStream: %d ms", endTime - startTime));
}

private static int readInt(InputStream in) throws IOException {
    int ret = 0;
    boolean dig = false;

    for (int c = 0; (c = in.read()) != -1; ) {
        if (c >= '0' && c <= '9') {
            dig = true;
            ret = ret * 10 + c - '0';
        } else if (dig) break;
    }

    return ret;
}

我得到的结果：

扫描仪总时间：789毫秒
StreamTokenizer的总时间：226 ms
StreamTokenizer和BufferedReader的总时间：80毫秒
BufferedInputStream的总时间：95毫秒

使用Java从文件中读取大量数据

问题描述

我的第一次尝试是`java.util.Scanner` ：

然后我尝试了`java.io.BufferedReader` ：

根据trashgod的回答：

7 个解决方案

解决方案1
13 已采纳 2010-04-23 13:12:41

解决方案2
2 2010-04-22 18:39:47

解决方案3
2 2010-04-22 21:07:12

解决方案4
1 2010-04-22 18:37:01

解决方案5
1 2010-04-22 18:47:02

解决方案6
0 2010-04-22 19:00:21

解决方案7
0 2016-11-27 14:46:57

使用Java从文件中读取大量数据

问题描述

我的第一次尝试是java.util.Scanner ：

然后我尝试了java.io.BufferedReader ：

根据trashgod的回答：

7 个解决方案

解决方案1 13 已采纳 2010-04-23 13:12:41

解决方案2 2 2010-04-22 18:39:47

解决方案3 2 2010-04-22 21:07:12

解决方案4 1 2010-04-22 18:37:01

解决方案5 1 2010-04-22 18:47:02

解决方案6 0 2010-04-22 19:00:21

解决方案7 0 2016-11-27 14:46:57

我的第一次尝试是`java.util.Scanner` ：

然后我尝试了`java.io.BufferedReader` ：

解决方案1
13 已采纳 2010-04-23 13:12:41

解决方案2
2 2010-04-22 18:39:47

解决方案3
2 2010-04-22 21:07:12

解决方案4
1 2010-04-22 18:37:01

解决方案5
1 2010-04-22 18:47:02

解决方案6
0 2010-04-22 19:00:21

解决方案7
0 2016-11-27 14:46:57