简体   繁体   中英

java read csv + specific sum of subarray - most efficient way

I need to read ints from large csv and then do specific sums with them. Currently I have algorithm that:

String csvFile = "D:/input.csv";
String line = "";
String cvsSplitBy = ";";
Vector<Int[]> converted = new Vector<Int[]>();

try (BufferedReader br = new BufferedReader(new FileReader(csvFile))) {

   while ((line = br.readLine()) != null) {
       String[] a = line.split(";",-1);
       int[] b = new int[a.length]; 
       for (int n = 0, n < a.length(), n++){
          b[n] = Integer.parseInt(a[n]);
       }
       converted.add(b);
   }
} 

catch (IOException e) {
e.printStackTrace();
}

int x = 7;
int y = 5;
int sum = 0;    

for (int m = 0; m < converted.size(); m++){
  for (n = 0, n < x, n++){
      sum = sum + converted.get(m)[n];
  }
  System.out.print(sum + " ");



  for (int n = x + y, n < converted.get(m).length, n = n + y){
      sum = 0;
      for (int o = n -y; o < n; o++)
         sum = sum + converted.get(m)[n];
      }
      System.out.print(sum + " ");
  }
  System.out.println("");
}

What I tried to do, is to get sum of first x members of a csv row, and then sum of x members every +y. (in this case sum of first x - 7(sum of 0-6), then sum of next x - 7, but y - 5 columns later(sum of 5-11), (sum of 10-16)... and write them, for every row.(in the end collecting line number with greatest (sum of 0-6), (sum of 5-11).., so final result should be for example 5,9,13,155..., which would mean line 5 had the greatest sum of 0-6, line 9 greatest sum of 5-11... ) As you can see, this is a quite inefficient way. First I've read whole csv into string[], then to int[] and saved to Vector. Then I created quite inefficient loop to do the work. I need this to run as fast as possible, as i will be using very large csv with lot of different x and y. What I was thinking about, but don't know how to do it is:

  1. do these sums in the reading loop
  2. do the sum differently, not always looping x members backward (either saving last sum and then subtract old and add new members, or other faster way to do subarray sum)
  3. use intStream and parallelism (parallel might be tricky as in the end i am looking for max )
  4. use different input then csv?
  5. all of the above?

How can I do this as fast as possible? Thank you

As the sums are per line, you do not need to first read all in memory.

Path csvFile = Paths.get("D:/input.csv");
try (BufferedReader br = Files.newBufferedReader(csvFile, StandardCharsets.ISO_8859_1)) {

     String line;
     while ((line = br.readLine()) != null) {
         int[] b = lineToInts(line);
         int n = b.length; 

         // Sum while reading:
         int sum = 0;
         for (int i = 0; i < 7; ++i) {
             sum += b[i];
         }
         System.out.print(sum + " ");

         sum = 0;
         for (int i = n - 5; i < n; ++i) {
             sum += b[i];
         }
         System.out.print(sum + " ");

         System.out.println();
     }
}

private static int[] lineToInts(String line) {
     // Using split is slow, one could optimize the implementation.
     String[] a = line.split(";", -1);
     int[] b = new int[a.length]; 
     for (int n = 0, n < a.length(), n++){
         b[n] = Integer.parseInt(a[n]);
     }
     return b;
}

A faster version:

private static int[] lineToInts(String line) {
    int semicolons = 0;
    for (int i = 0; (i = line.indexOf(';', i)) != -1; ++i) {
        ++semicolons;
    }
    int[] b = new int[semicolons + 1];
    int pos = 0;
    for (int i = 0; i < b.length(); ++i) {
        int pos2 = line.indexOf(';', pos);
        if (pos2 < 0) {
            pos2 = line.length();
        }
        b[i] = Integer.parseInt(line.substring(pos, pos2));
        pos = pos2 + 1;
    }
    return b;
}

As an aside: Vector is old, better use List and ArrayList.

List<int[]> converted = new ArrayList<>(10_000);

Above the optional argument of initial capacity is given: ten thousand.

The weird try-with-resource syntax try (BufferedReader br = ...) { ensures that br is alway automatically closed. Even on exception or return.


Parallelism and after reformatting the question

You could read all lines

List<String> lines = Files.readAllLines(csvFile, StandardCharsets.ISO_8859_1);

And than play with parallel streams like:

OptionalInt max = lines.parallelStream()
    .mapToInt(line -> {
        int[] b = lineToInst(line);
        ...
        return sum;
    }).max();

or:

IntStream.range(0, lines.size()).parallel()
    .mapToObj(i -> {
        String line = lines.get(i);
        ...
        return new int[] { i, sum5, sum7 };
    }); 

You could probably try to create some of your sums while reading the input. Might also be feasible to use HashMaps of type Integer,Integer

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM