简体   繁体   中英

what is the fastest way to get dimensions of a csv file in java

My regular procedure when coming to the task on getting dimensions of a csv file as following:

  1. Get how many rows it has:

I use a while loop to read every lines and count up through each successful read. The cons is that it takes time to read the whole file just to count how many rows it has.

  1. then get how many columns it has: I use String[] temp = lineOfText.split(","); and then take the size of temp.

Is there any smarter method? Like:
file1 = read.csv;
xDimention = file1.xDimention;
yDimention = file1.yDimention;

I guess it depends on how regular the structure is, and whether you need an exact answer or not.

I could imagine looking at the first few rows (or randomly skipping through the file), and then dividing the file size by average row size to determine a rough row count.

If you control how these files get written, you could potentially tag them or add a metadata file next to them containing row counts.

Strictly speaking, the way you're splitting the line doesn't cover all possible cases. "hello, world", 4, 5 should read as having 3 columns, not 4.

Your approach won't work with multi-line values (you'll get an invalid number of rows) and quoted values that might happen to contain the deliminter (you'll get an invalid number of columns).

You should use a CSV parser such as the one provided by univocity-parsers .

Using the uniVocity CSV parser, that fastest way to determine the dimensions would be with the following code. It parses a 150MB file to give its dimensions in 1.2 seconds :

// Let's create our own RowProcessor to analyze the rows
static class CsvDimension extends AbstractRowProcessor {

    int lastColumn = -1;
    long rowCount = 0;

    @Override
    public void rowProcessed(String[] row, ParsingContext context) {
        rowCount++;
        if (lastColumn < row.length) {
            lastColumn = row.length;
        }
    }
}

public static void main(String... args) throws FileNotFoundException {
     // let's measure the time roughly
    long start = System.currentTimeMillis();

    //Creates an instance of our own custom RowProcessor, defined above.
    CsvDimension myDimensionProcessor = new CsvDimension();

    CsvParserSettings settings = new CsvParserSettings();

    //This tells the parser that no row should have more than 2,000,000 columns
    settings.setMaxColumns(2000000);

    //Here you can select the column indexes you are interested in reading.
    //The parser will return values for the columns you selected, in the order you defined
    //By selecting no indexes here, no String objects will be created
    settings.selectIndexes(/*nothing here*/);

    //When you select indexes, the columns are reordered so they come in the order you defined.
    //By disabling column reordering, you will get the original row, with nulls in the columns you didn't select
    settings.setColumnReorderingEnabled(false);

    //We instruct the parser to send all rows parsed to your custom RowProcessor. 
    settings.setRowProcessor(myDimensionProcessor);

    //Finally, we create a parser
    CsvParser parser = new CsvParser(settings);

    //And parse! All rows are sent to your custom RowProcessor (CsvDimension)
    //I'm using a 150MB CSV file with 1.3 million rows. 
    parser.parse(new FileReader(new File("c:/tmp/worldcitiespop.txt")));

    //Nothing else to do. The parser closes the input and does everything for you safely. Let's just get the results:
    System.out.println("Columns: " + myDimensionProcessor.lastColumn);
    System.out.println("Rows: " + myDimensionProcessor.rowCount);
    System.out.println("Time taken: " + (System.currentTimeMillis() - start) + " ms");

}

The output will be:

Columns: 7
Rows: 3173959
Time taken: 1279 ms

Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).

To find number of rows you have to read the whole file. There is nothing you can do here. However your method of finding number of cols is a bit inefficient. Instead of split just count how many times "," appeard in the line. You might also include here special condition about fields put in the quotas as mentioned by @Vlad.

String.split method creates an array of strings as a result and splits using regexp which is not very efficient.

IMO, What you are doing is an acceptable way to do it. But here are some ways you could make it faster:

  1. Rather than reading lines, which creates a new String Object for each line, just use String.indexOf to find the bounds of your lines
  2. Rather than using line.split, again use indexOf to count the number of commas
  3. Multithreading

I guess here are the options which will depend on how you use the data:

  1. Store dimensions of your csv file when writing the file (in the first row or as in an additional file)
  2. Use a more efficient way to count lines - maybe http://docs.oracle.com/javase/6/docs/api/java/io/LineNumberReader.html
  3. Instead of creating an arrays of fixed size (assuming thats what you need the line count for) use array lists - this may or may not be more efficient depending on size of file.

I find this short but interesting solution here: https://stackoverflow.com/a/5342096/4082824

LineNumberReader  lnr = new LineNumberReader(new FileReader(new File("File1")));
lnr.skip(Long.MAX_VALUE);
System.out.println(lnr.getLineNumber() + 1); //Add 1 because line index starts at 0
lnr.close();

My solution is simply and correctly process CSV with multiline cells or quoted values.

for example We have csv-file:

1,"""2""","""111,222""","""234;222""","""""","1
2
3"
2,"""2""","""111,222""","""234;222""","""""","2
3"
3,"""5""","""1112""","""10;2""","""""","1
2"

And my solution snippet is:

import java.io.*;

public class CsvDimension {

    public void parse(Reader reader) throws IOException {
        long cells = 0;
        int lines = 0;
        int c;
        boolean qouted = false;
        while ((c = reader.read()) != -1) {
            if (c == '"') {
                 qouted = !qouted;
            }
            if (!qouted) {
                if (c == '\n') {
                    lines++;
                    cells++;
                }
                if (c == ',') {
                    cells++;
                }
            }
        }
        System.out.printf("lines : %d\n cells %d\n cols: %d\n", lines, cells, cells / lines);
        reader.close();
    }

    public static void main(String args[]) throws IOException {
        new CsvDimension().parse(new BufferedReader(new FileReader(new File("test.csv"))));
    }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM