简体   繁体   中英

Most efficient way to process a large csv in .NET

Forgive my noobiness but I just need some guidance and I can't find another question that answers this. I have a fairly large csv file (~300k rows) and I need to determine for a given input, whether any line in the csv begins with that input. I have sorted the csv alphabetically, but I don't know:

1) how to process the rows in the csv- should I read it in as a list/collection, or use OLEDB, or an embedded database or something else?

2) how to find something efficiently from an alphabetical list (using the fact that it's sorted to speed things up, rather than searching the whole list)

You don't give enough specifics to give you a concrete answer but...


IF the CSV file changes often then use OLEDB and just change the SQL query based on your input.

string sql = @"SELECT * FROM [" + fileName + "] WHERE Column1 LIKE 'blah%'";
using(OleDbConnection connection = new OleDbConnection(
          @"Provider=Microsoft.Jet.OLEDB.4.0;Data Source=" + fileDirectoryPath + 
          ";Extended Properties=\"Text;HDR=" + hasHeaderRow + "\""))

IF the CSV file doesn't change often and you run a lot of "queries" against it, load it once into memory and quickly search it each time.

IF you want your search to be an exact match on a column use a Dictionary where the key is the column you want to match on and the value is the row data.

Dictionary<long, string> Rows = new Dictionar<long, string>();
...
if(Rows.ContainsKey(search)) ...

IF you want your search to be a partial match like StartsWith then have 1 array containing your searchable data (ie: first column) and another list or array containing your row data. Then use C#'s built in binary search http://msdn.microsoft.com/en-us/library/2cy9f6wb.aspx

string[] SortedSearchables = new string[];
List<string> SortedRows = new List<string>();
...
string result = null;
int foundIdx = Array.BinarySearch<string>(SortedSearchables, searchTerm);
if(foundIdx < 0) {
    foundIdx = ~foundIdx;
    if(foundIdx < SortedRows.Count && SortedSearchables[foundIdx].StartsWith(searchTerm)) {
        result = SortedRows[foundIdx];
    }
} else {
    result = SortedRows[foundIdx];
}

NOTE code was written inside the browser window and may contain syntax errors as it wasn't tested.

If you're only doing it once per program run, this seems pretty fast. (Updated to use StreamReader instead of FileStream based on comments below)

    static string FindRecordBinary(string search, string fileName)
    {
        using (StreamReader fs = new StreamReader(fileName))
        {
            long min = 0; // TODO: What about header row?
            long max = fs.BaseStream.Length;
            while (min <= max)
            {
                long mid = (min + max) / 2;
                fs.BaseStream.Position = mid;

                fs.DiscardBufferedData();
                if (mid != 0) fs.ReadLine();
                string line = fs.ReadLine();
                if (line == null) { min = mid+1; continue; }

                int compareResult;
                if (line.Length > search.Length)
                    compareResult = String.Compare(
                        line, 0, search, 0, search.Length, false );
                else
                    compareResult = String.Compare(line, search);

                if (0 == compareResult) return line;
                else if (compareResult > 0) max = mid-1;
                else min = mid+1;
            }
        }
        return null;
    }

This runs in 0.007 seconds for a 600,000 record test file that's 50 megs. In comparison a file-scan averages over half a second depending where the record is located. (a 100 fold difference)

Obviously if you do it more than once, caching is going to speed things up. One simple way to do partial caching would be to keep the StreamReader open and re-use it, just reset min and max each time through. This would save you storing 50 megs in memory all the time.

EDIT: Added knaki02's suggested fix.

If you can cache the data in memory, and you only need to search the list on one primary key column, I would recommend storing the data in memory as a Dictionary object. The Dictionary class stores the data as key/value pairs in a hash table. You could use the primary key column as the key in the dictionary, and then use the rest of the columns as the value in the dictionary. Looking up items by key in a hash table is typically very fast.

For instance, you could load the data into a dictionary, like this:

Dictionary<string, string[]> data = new Dictionary<string, string[]>();
using (TextFieldParser parser = new TextFieldParser("C:\test.csv"))
{
    parser.TextFieldType = FieldType.Delimited;
    parser.SetDelimiters(",");
    while (!parser.EndOfData)
    {
        try
        {
            string[] fields = parser.ReadFields();
            data[fields[0]] = fields;
        }
        catch (MalformedLineException ex)
        {
            // ...
        }
    }
}

And then you could get the data for any item, like this:

string fields[] = data["key I'm looking for"];

Given the CSV is sorted - if you can load the entire thing into memory (If the only processing you need to do is a .StartsWith() on each line) - you can use a Binary search to have exceptionally fast searching.

Maybe something like this (NOT TESTED!):

var csv = File.ReadAllLines(@"c:\file.csv").ToList();
var exists = csv.BinarySearch("StringToFind", new StartsWithComparer());

...

public class StartsWithComparer: IComparer<string>
{
    public int Compare(string x, string y)
    {
        if(x.StartsWith(y))
            return 0;
        else
            return x.CompareTo(y);
    }
}

If your file is in memory (for example because you did sorting) and you keep it as an array of strings (lines) then you can use a simple bisection search method. You can start with the code on this question on CodeReview , just change the comparer to work with string instead of int and to check only the beginning of each line.

If you have to re-read the file each time because it may be changed or it's saved/sorted by another program then the most simple algorithm is the best one:

using (var stream = File.OpenText(path))
{
    // Replace this with you comparison, CSV splitting
    if (stream.ReadLine().StartsWith("..."))
    {
        // The file contains the line with required input
    }
}

Of course you may read the entire file in memory (to use LINQ or List<T>.BinarySearch() ) each time but this is far from optimal (you'll read everything even if you may need to examine just few lines) and the file itself could even be too large.

If you really need something more and you do not have your file in memory because of sorting (but you should profile your actual performance compared to your requirements) you have to implement a better search algorithm, for example the Boyer-Moore algorithm .

OP stated really just needs to search based on line.

The questions is then to hold the lines in memory or not.

If the line 1 k then 300 mb of memory.
If a line is 1 meg then 300 gb of memory.

Stream.Readline will have a low memory profile
Since it is sorted you can stop looking once it is greater than.

If you hold it in memory then a simple

List<String> 

With LINQ will work.
LINQ is not smart enough to take advantage of the sort but against 300K would still be pretty fast.

BinarySearch will take advantage of the sort.

I wrote this quickly for work, could be improved on...

Define the column numbers:

private enum CsvCols
{
    PupilReference = 0,
    PupilName = 1,
    PupilSurname = 2,
    PupilHouse = 3,
    PupilYear = 4,
}

Define the Model

public class ImportModel
{
    public string PupilReference { get; set; }
    public string PupilName { get; set; }
    public string PupilSurname { get; set; }
    public string PupilHouse { get; set; }
    public string PupilYear { get; set; }
}

Import and populate a list of models:

  var rows = File.ReadLines(csvfilePath).Select(p => p.Split(',')).Skip(1).ToArray();

    var pupils = rows.Select(x => new ImportModel
    {
        PupilReference = x[(int) CsvCols.PupilReference],
        PupilName = x[(int) CsvCols.PupilName],
        PupilSurname = x[(int) CsvCols.PupilSurname],
        PupilHouse = x[(int) CsvCols.PupilHouse],
        PupilYear = x[(int) CsvCols.PupilYear],

    }).ToList();

Returns you a list of strongly typed objects

Try the free CSV Reader . No Need to invent the wheel over and over again ;)

1) If you do not need to store the results, just iterate though the CSV - handle each line and forget it. If you need to process all lines again and again, store them in a List or Dictionary (with a good key of course)

2) Try the generic extension methods like this

var list = new List<string>() { "a", "b", "c" };
string oneA = list.FirstOrDefault(entry => !string.IsNullOrEmpty(entry) && entry.ToLowerInvariant().StartsWidth("a"));
IEnumerable<string> allAs = list.Where(entry => !string.IsNullOrEmpty(entry) && entry.ToLowerInvariant().StartsWidth("a"));

Here is my VB.net Code. It is for a Quote Qualified CSV, so for a regular CSV, change Let n = P.Split(New Char() {""","""}) to Let n = P.Split(New Char() {","})

Dim path as String = "C:\linqpad\Patient.txt"
Dim pat = System.IO.File.ReadAllLines(path)
Dim Patz = From P in pat _
    Let n = P.Split(New Char() {""","""}) _
    Order by n(5) _
    Select New With {
        .Doc =n(1), _
        .Loc = n(3), _
        .Chart = n(5), _
        .PatientID= n(31), _
        .Title = n(13), _
        .FirstName = n(9), _
        .MiddleName = n(11), _
        .LastName = n(7), 
        .StatusID = n(41) _
        }
Patz.dump

Normally I would recommend finding a dedicated CSV parser (like this or this ). However, I noticed this line in your question:

I need to determine for a given input, whether any line in the csv begins with that input.

That tells me that computer time spend parsing CSV data before this is determined is time wasted. You just need code to simply match text for text, and you can do that via a string comparison as easily as anything else.

Additionally, you mention that the data is sorted. This should allow you speed things up tremendously ... but you need to be aware that to take advantage of this you will need to write your own code to make seek calls on low-level file streams. This will be by far your best performing result, but it will also by far require the most initial work and maintenance.

I recommend an engineering based approach, where you set a performance goal, build something relatively simple, and measure the results against that goal. In particular, start with the 2nd link I posted above. The CSV reader there will only load one record into memory at a time, so it should perform reasonably well, and it's easy to get started with. Build something that uses that reader, and measure the results. If they meet your goal, then stop there.

If they don't meet your goal, adapt the code from the link so that as you read each line you first do a string comparison (before bothering to parse the csv data), and only do the work to parse csv for the lines that match. This should perform better, but only do the work if the first option does not meet your goals. When this is ready, measure the performance again.

Finally, if you still don't meet the performance goal, we're into the territory of writing low-level code to do a binary search on your file stream using seek calls. This is likely the best you'll be able to do, performance-wise, but it will be very messy and bug-prone code to write, and so you only want to go here if you absolutely do not meet your goals from earlier steps.

Remember, performance is a feature, and just like any other feature you need to evaluate how you build for that feature relative to real design goals. "As fast as possible" is not a reasonable design goal. Something like "respond to a user search within .25 seconds" is a real design goal, and if the simpler but slower code still meets that goal, you need to stop there.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM