简体   繁体   中英

C# : Fastest way for specific columns in CSV Files

I have a very large CSV file (Millions of records)
I have developed a smart search algorithm to locate specific line ranges in the file to avoid parsing the whole file.

Now I am facing a trickier issue : I am only interested in the content of a specific column.
Is there a smart way to avoid looping line by line through a 200MB Files and retrieve only the content of a specific column?

You mean get every value from every row for a specific column?

You're probably going to have to visit every row to do that.

This C# CSV Reading library is very quick so you might be able to use it:

LumenWorks.Framework.IO.Csv by Sebastien Lorien

I'd use an existing library as codeulike has suggested, and for a very good reason why read this article:

Stop Rolling Your Own CSV Parser!

Unless all CSV fields have a fixed width (and even if empty there's still n bytes of blank space between the separators surrounding it), no.

If yes

Then each row, in turn, also has a fixed length and therefore you can skip straight to the first value for that column and, once you've read it, you immediately advance to next row's value for the same field, without having to read any intermediate values.

I think this is pretty simple - but I'm on a roll at the moment (and at lunch), so I'm going to finish it anyway :)

To do this, we first want to know how long each row is in characters (adjust for bytes according to Unicode, UTF8 etc):

row_len = sum(widths[0..n-1]) + n-1 + row_sep_length

Where n is the total number of columns on each row - this is a constant for the whole file. We add an extra n-1 to it to account for the separators between column values.

And row_sep_length is the length of the separator between two rows - usually a newline, or potentially a [carriage-return & line-feed] pair.

The value for a column row[r]col[i] will be offset characters from the start of row[r]where offset is defined as:

offset = i>0 ? sum(widths[0..i-1]) + i) : 0;
//or sum of widths of all columns before col[i]
//plus one character for each separator between adjacent columns

And then, assuming you've read the whole column value, up to the next separator, the offset to the starting character for next column value row[r+1]col[i] is calculated by subtracting the width of your column from the row length. This is yet another constant for the file:

next-field-offset = row_len - widths[i];
//widths[i] is the width of the field you are actually reading.

All the while - i is zero-based in this pseudo code as is the indexing of the vectors/arrays.

To read, then, you first advance the file pointer by offset characters - taking you to the first value you want. You read the value (taking you to the next separator) and then simply advance the file pointer by next-field-offset characters. If you reach EOF at this point, you're done.

I might have missed a character either way in this - so if it's applicable - do check it!

This only works if you can guarantee that all field values - even nulls - for all rows will be the same length, and that the separators are always the same length and that alll row separators are the same length. If not - then this approach won't work.

If not

You'll have to do it the slow way - find the column in each line and do whatever it is you need to do.

If you're doing a significant amount of work on the column value each time, one optimisation will be to pull out all the column values first into a list (set with a known initial capacity too) or something (batching at 100,000 a time or something like that), then iterate through those.

If you keep each loop focused on a single task, that should be more efficient than one big loop.

Equally, once you've batched a 100,000 column values you could use Parallel Linq to distribute the second loop (not the first since there's no point parallelising reading from a file).

There are only shortcuts if you can pose specific limitations on the data.

For example, you can only read the file line by line if you know that there are no values in the file that contain line breaks. If you don't know this, you have to parse the file record by record as a stream, and each record ends where there is a line break that is not inside a value.

However, unless you know that each line takes up exactly the same amount of bytes, there is no other way to read the file than to read line by line. The line breaks in a file is just another pair of characters, there is no other way to locate a line in a text file than to read all the lines that comes before it.

You can do similar shortcuts when reading a record if you can pose limiations on the fields in the records. If you for example know that the fields to the left of the one that you are interrested in are all numerical, you can use a simpler parsing method to find the start of the field.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM