简体   繁体   English

读取双打矩阵的有效方法

[英]Efficient way to read a matrix of doubles

What is a very fast way to read in a clean matrix of all doubles (no missing elements on NAs are in this matrix). 读取所有双精度的干净矩阵的快速方法是什么(在此矩阵中没有遗漏的元素)。 Most entries are non-zero doubles, maybe 30% are zero. 大多数条目都是非零双精度,可能30%为零。 The dimensions are around 1 million rows and 100 columns. 尺寸约为100万行和100列。

The function I am using is below. 我正在使用的功能如下。 However it is quite slow for matrices over 1 gigabyte. 然而,对于超过1千兆字节的矩阵来说,这是非常慢的。

How can I do this faster? 我怎么能更快地做到这一点? Would any of the following help: - Instead of saving as csv and reading that, try saving as a binary format or another format. 是否有以下任何帮助: - 而不是保存为csv并读取它,尝试保存为二进制格式或其他格式。 - Transpose the matrix in the data file, then read column by column instead of reading row by row as the below function is doing. - 将矩阵转置到数据文件中,然后逐列读取,而不是像下面的函数那样逐行读取。 - Somehow serializing the matrix as a Java object for re-reads. - 以某种方式将矩阵序列化为Java对象以进行重新读取。

 private static Vector<Vector<Double>> readTXTFile(String csvFileName, int skipRows) throws IOException {
     String line = null;
     BufferedReader stream = null;
     Vector<Vector<Double>> csvData = new Vector<Vector<Double>>();

     try {
         stream = new BufferedReader(new FileReader(csvFileName));
         int count = 0;
         while ((line = stream.readLine()) != null) {
            count += 1;
            if(count <= skipRows) {
                continue;
            }
             String[] splitted = line.split(",");
             Vector<Double> dataLine = new Vector<Double>(splitted.length);
             for (String data : splitted) {
                 dataLine.add(Double.valueOf(data));
             }

            csvData.add(dataLine);
         }
     } finally {
         if (stream != null)
             stream.close();
     }

     return csvData;
 }

I changed your code to get rid of all of the creation of Vector and Double objects in favor of using a fix-sized matrix (which does assume you know or can calculate the number of rows and columns in the file ahead of time). 我改变了你的代码以摆脱所有Vector和Double对象的创建,转而使用一个固定大小的矩阵(假设你知道或者可以提前计算文件中的行数和列数)。

I threw 500,000 line files at it and was seeing about 25% improvement. 我扔了500,000行文件,看到了大约25%的改进。

private static double[][] readTXTFile(String csvFileName, int skipRows) throws IOException {
    BufferedReader stream = null;
    int totalRows = 500000, totalColumns = 6;
    double[][] matrix = new double[totalRows][totalColumns];

    try {
        stream = new BufferedReader(new FileReader(csvFileName));
        for (int currentRow = 0; currentRow < totalRows; currentRow++) {
            String line = stream.readLine();
            if (currentRow <= skipRows) {
                continue;
            }
            String[] splitted = line.split(",");
            for (int currentColumn = 0; currentColumn < totalColumns; currentColumn++) {
                matrix[currentRow][currentColumn] = Double.parseDouble(splitted[currentColumn]);
            }
        }
    } finally {
        if (stream != null) {
            stream.close();
        }
    }
    return matrix;
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM