什么是最快的方法来获取csv文件在Java中的尺寸

Question

我在获取csv文件尺寸的任务时的常规过程如下：

获取它有多少行：

我使用while循环读取每一行，并通过每次成功的读取计数。 缺点是读取整个文件只需要计算一下它有多少行就需要花费时间。

然后获得多少列：我使用String[] temp = lineOfText.split(","); 然后取temp的大小。

有没有更聪明的方法？ 喜欢：
file1 = read.csv;
xDimention = file1.xDimention;
yDimention = file1.yDimention;

Answer 1

我猜这取决于结构的规则程度，以及您是否需要确切的答案。

我可以想象看一下前几行（或随机跳过文件），然后将文件大小除以平均行大小以确定粗略的行数。

如果控制这些文件的写入方式，则可以标记它们或在包含行数的文件旁边添加元数据文件。

严格来说，分界线并不能涵盖所有可能的情况。 "hello, world", 4, 5应该读为3列，而不是4列。

Answer 2

您的方法不适用于多行值（行数无效）和带引号的值，这些值可能恰好包含deliminter（列数无效）。

您应该使用CSV解析器，例如univocity-parsers提供的解析器。

使用uniVocity CSV解析器，确定尺寸的最快方法是使用以下代码。 它解析一个150MB的文件，以在1.2秒内给出其尺寸：

// Let's create our own RowProcessor to analyze the rows
static class CsvDimension extends AbstractRowProcessor {

    int lastColumn = -1;
    long rowCount = 0;

    @Override
    public void rowProcessed(String[] row, ParsingContext context) {
        rowCount++;
        if (lastColumn < row.length) {
            lastColumn = row.length;
        }
    }
}

public static void main(String... args) throws FileNotFoundException {
     // let's measure the time roughly
    long start = System.currentTimeMillis();

    //Creates an instance of our own custom RowProcessor, defined above.
    CsvDimension myDimensionProcessor = new CsvDimension();

    CsvParserSettings settings = new CsvParserSettings();

    //This tells the parser that no row should have more than 2,000,000 columns
    settings.setMaxColumns(2000000);

    //Here you can select the column indexes you are interested in reading.
    //The parser will return values for the columns you selected, in the order you defined
    //By selecting no indexes here, no String objects will be created
    settings.selectIndexes(/*nothing here*/);

    //When you select indexes, the columns are reordered so they come in the order you defined.
    //By disabling column reordering, you will get the original row, with nulls in the columns you didn't select
    settings.setColumnReorderingEnabled(false);

    //We instruct the parser to send all rows parsed to your custom RowProcessor. 
    settings.setRowProcessor(myDimensionProcessor);

    //Finally, we create a parser
    CsvParser parser = new CsvParser(settings);

    //And parse! All rows are sent to your custom RowProcessor (CsvDimension)
    //I'm using a 150MB CSV file with 1.3 million rows. 
    parser.parse(new FileReader(new File("c:/tmp/worldcitiespop.txt")));

    //Nothing else to do. The parser closes the input and does everything for you safely. Let's just get the results:
    System.out.println("Columns: " + myDimensionProcessor.lastColumn);
    System.out.println("Rows: " + myDimensionProcessor.rowCount);
    System.out.println("Time taken: " + (System.currentTimeMillis() - start) + " ms");

}

输出将是：

Columns: 7
Rows: 3173959
Time taken: 1279 ms

披露：我是这个图书馆的作者。 它是开源且免费的（Apache V2.0许可证）。

Answer 3

要查找行数，您必须读取整个文件。 在这里您无能为力。 但是，您发现列数的方法效率低下。 而不是split而是计算该行中出现了多少次“，”。 您可能还会在此处添加有关@Vlad提到的放入配额中的字段的特殊条件。

String.split方法创建一个字符串数组作为结果，并使用regexp进行拆分，效率不是很高。

Answer 4

IMO，您正在做的事情是可以接受的方法。 但是您可以通过以下几种方法来加快速度：

无需读取行，而是为每行创建一个新的String对象，而只需使用String.indexOf查找行的边界
而不是使用line.split，再次使用indexOf来计算逗号数
多线程

Answer 5

我猜这是取决于您如何使用数据的选项：

写入文件时存储csv文件的尺寸（在第一行或其他文件中）
使用更有效的方式来计算行数-也许http://docs.oracle.com/javase/6/docs/api/java/io/LineNumberReader.html
与其创建固定大小的数组（假设这就是您需要的行数），不如使用数组列表-根据文件大小，这可能会或可能不会更有效率。

Answer 6

我在这里找到了这个简短但有趣的解决方案： https : //stackoverflow.com/a/5342096/4082824

LineNumberReader  lnr = new LineNumberReader(new FileReader(new File("File1")));
lnr.skip(Long.MAX_VALUE);
System.out.println(lnr.getLineNumber() + 1); //Add 1 because line index starts at 0
lnr.close();

Answer 7

我的解决方案是简单，正确地使用多行单元格或带引号的值处理CSV。

例如，我们有csv文件：

1,"""2""","""111,222""","""234;222""","""""","1
2
3"
2,"""2""","""111,222""","""234;222""","""""","2
3"
3,"""5""","""1112""","""10;2""","""""","1
2"

我的解决方案片段是：

import java.io.*;

public class CsvDimension {

    public void parse(Reader reader) throws IOException {
        long cells = 0;
        int lines = 0;
        int c;
        boolean qouted = false;
        while ((c = reader.read()) != -1) {
            if (c == '"') {
                 qouted = !qouted;
            }
            if (!qouted) {
                if (c == '\n') {
                    lines++;
                    cells++;
                }
                if (c == ',') {
                    cells++;
                }
            }
        }
        System.out.printf("lines : %d\n cells %d\n cols: %d\n", lines, cells, cells / lines);
        reader.close();
    }

    public static void main(String args[]) throws IOException {
        new CsvDimension().parse(new BufferedReader(new FileReader(new File("test.csv"))));
    }
}

什么是最快的方法来获取csv文件在Java中的尺寸

问题描述

7 个解决方案

解决方案1
3 2015-06-03 15:43:18

解决方案2
2 已采纳 2015-06-04 06:56:40

解决方案3
0 2015-06-03 15:44:34

解决方案4
0 2015-06-03 15:49:29

解决方案5
0 2015-06-03 15:49:53

解决方案6
0 2015-06-10 18:29:55

解决方案7
0 2015-12-10 07:27:17

什么是最快的方法来获取csv文件在Java中的尺寸

问题描述

7 个解决方案

解决方案1 3 2015-06-03 15:43:18

解决方案2 2 已采纳 2015-06-04 06:56:40

解决方案3 0 2015-06-03 15:44:34

解决方案4 0 2015-06-03 15:49:29

解决方案5 0 2015-06-03 15:49:53

解决方案6 0 2015-06-10 18:29:55

解决方案7 0 2015-12-10 07:27:17

解决方案1
3 2015-06-03 15:43:18

解决方案2
2 已采纳 2015-06-04 06:56:40

解决方案3
0 2015-06-03 15:44:34

解决方案4
0 2015-06-03 15:49:29

解决方案5
0 2015-06-03 15:49:53

解决方案6
0 2015-06-10 18:29:55

解决方案7
0 2015-12-10 07:27:17