简体   繁体   中英

How to randomly read a full row (including possible line breaks) from a big csv file in java

I have a big CSV file whose size is not specific and maybe more than 4 GB. I need to read some rows from the file randomly as test cases to do some tests in an application.

It's impossible to read the full file in memory because it will raise an OutOfMemoryError exception.

One solution is to generate an array of some numbers falling in the range of the total number, then sort the list. At last read from the file line by line according to the number stored in the array. So I could get a random set of full rows from the csv file.

Is there a library or method to read a full row from a big csv file randomly ?

One solution:

// generate random numbers
List<Integer> indexList = new ArrayList<>();
for (int i = 0; i < testCount; i++) {
    int random = faker.numberBetween(0, total);
    indexList.add(random);
}

// sort
Collections.sort(indexList);

// read from a file
List<String> list = new ArrayList<>();
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream("test.csv"), "UTF-8"));

String line;
int lineNum = 0;
int pos = 0;
int currentNum = indexList.get(pos);
while ((line = reader.readLine()) != null) {

    while (currentNum == lineNum) {

        list.add(line);
        pos++;

        if (pos == testCount)
            break;

        currentNum = indexList.get(pos);
    }

    if (pos == testCount)
        break;

    lineNum++;
}

reader.close();

Reservoir sampling is an algorithm that comes to mind here. The nice thing about this is that you don't need to know how many items there are and you don't have to read the whole file into memory; just the next row as long as necessary.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM