简体   繁体   中英

Most efficient way to search enumerable

I am writing a small program that takes in a .csv file as input with about 45k rows. I am trying to compare the contents of this file with the contents of a table on a database (SQL Server through dynamics CRM using Xrm.Sdk if it makes a difference).

In my current program (which takes about 25 minutes to compare - the file and database are the exact same here both 45k rows with no differences), I have all existing records on the database in a DataCollection<Entity> which inherits Collection<T> and IEnumerable<T>

In my code below I am filtering using the Where method and then doing a logic based the count of matches. The Where seems to be the bottleneck here. Is there a more efficient approach than this? I am by no means a LINQ expert.

foreach (var record in inputDataLines)
{
    var fields = record.Split(',');

    var fund = fields[0];
    var bps = Convert.ToDecimal(fields[1]);
    var withdrawalPct = Convert.ToDecimal(fields[2]);
    var percentile = Convert.ToInt32(fields[3]);
    var age = Convert.ToInt32(fields[4]);
    var bombOutTerm = Convert.ToDecimal(fields[5]);

    var matchingRows = existingRecords.Entities.Where(r => r["field_1"].ToString() == fund
                                      && Convert.ToDecimal(r["field_2"]) == bps
                                      && Convert.ToDecimal(r["field_3"]) == withdrawalPct
                                      && Convert.ToDecimal(r["field_4"]) == percentile
                                      && Convert.ToDecimal(r["field_5"]) == age);

    entitiesFound.AddRange(matchingRows);

    if (matchingRows.Count() == 0)
    {
        rowsToAdd.Add(record);
    }
    else if (matchingRows.Count() == 1)
    {
        if (Convert.ToDecimal(matchingRows.First()["field_6"]) != bombOutTerm)
        {
            rowsToUpdate.Add(record);
            entitiesToUpdate.Add(matchingRows.First());
        }
    }
    else
    {
        entitiesToDelete.AddRange(matchingRows);
        rowsToAdd.Add(record);
    }
}

EDIT: I can confirm that all existingRecords are in memory before this code is executed. There is no IO or DB access in the above loop.

Himbrombeere is right, you should execute the query first and put the result into a collection before you use Any , Count , AddRange or whatever method will execute the query again. In your code it's possible that the query is executed 5 times in every loop iteration.

Watch out for the term deferred execution in the documentation. If a method is implemented in that way, then it means that this method can be used to construct a LINQ query(so you can chain it with other methods and at the end you have a query). But only methods that don't use deferred execution like Count , Any , ToList (or a plain foreach ) will actually execute it. If you dont want that the whole query is executed everytime and you have to access this query multiple times , it's better to store the result in a collection(.fe with ToList ).

However, you could use a different approach which should be much more efficient, a Lookup<TKey, TValue> which is similar to a dictionary and can be used with an anonymous type as key:

var lookup = existingRecords.Entities.ToLookup(r => new 
{
    fund = r["field_1"].ToString(),
    bps = Convert.ToDecimal(r["field_2"]),
    withdrawalPct =  Convert.ToDecimal(r["field_3"]),
    percentile = Convert.ToDecimal(r["field_4"]),
    age = Convert.ToDecimal(r["field_5"])
});

Now you can access this lookup in the loop very efficiently.

foreach (var record in inputDataLines)
{
    var fields = record.Split(',');
    var fund = fields[0];
    var bps = Convert.ToDecimal(fields[1]);
    var withdrawalPct = Convert.ToDecimal(fields[2]);
    var percentile = Convert.ToInt32(fields[3]);
    var age = Convert.ToInt32(fields[4]);
    var bombOutTerm = Convert.ToDecimal(fields[5]);

    var matchingRows = lookup[new {fund, bps, withdrawalPct, percentile, age}].ToList();

    entitiesFound.AddRange(matchingRows);

    if (matchingRows.Count() == 0)
    {
        rowsToAdd.Add(record);
    }
    else if (matchingRows.Count() == 1)
    {
        if (Convert.ToDecimal(matchingRows.First()["field_6"]) != bombOutTerm)
        {
            rowsToUpdate.Add(record);
            entitiesToUpdate.Add(matchingRows.First());
        }
    }
    else
    {
        entitiesToDelete.AddRange(matchingRows);
        rowsToAdd.Add(record);
    }
}

Note that this will work even if the key does not exist(an empty list is returned).

Add a ToList after your Convert.ToDecimal(r["field_5"]) == age); -line to force an immediate execution of the query.

var matchingRows = existingRecords.Entities.Where(r => r["field_1"].ToString() == fund
                                  && Convert.ToDecimal(r["field_2"]) == bps
                                  && Convert.ToDecimal(r["field_3"]) == withdrawalPct
                                  && Convert.ToDecimal(r["field_4"]) == percentile
                                  && Convert.ToDecimal(r["field_5"]) == age)
                    .ToList();

The Where doesn´t actually execute your query, it just prepares it. The actual execution happens later in a delayed way. In your case that happens when calling Count which itself will iterate the entire collection of items. But if the first condition fails, the second one is checked leading to a second iteration of the complete collection when calling Count . In this case you actually execute that query a thrird time when calling matchingRows.First() .

When forcing an immediate execution you´re executing the query only once and thus iterating the entire collection only once also which will decrease your overall-time.

Another option, which is basically along the same lines as the other answers, is to prepare your data first, so that you're not repeatedly calling things like r["field_2"] (which are relatively slow to look up).

This is a (1) clean your data, (2) query/join your data, (3) process your data approach.

Do this:

(1)

var inputs =
    inputDataLines
        .Select(record =>
        {
            var fields = record.Split(',');
            return new
            {
                fund = fields[0],
                bps = Convert.ToDecimal(fields[1]),
                withdrawalPct = Convert.ToDecimal(fields[2]),
                percentile = Convert.ToInt32(fields[3]),
                age = Convert.ToInt32(fields[4]),
                bombOutTerm = Convert.ToDecimal(fields[5]),
                record
            };
        })
        .ToArray();

var entities =
    existingRecords
        .Entities
        .Select(entity => new
        {
            fund = entity["field_1"].ToString(),
            bps = Convert.ToDecimal(entity["field_2"]),
            withdrawalPct = Convert.ToDecimal(entity["field_3"]),
            percentile = Convert.ToInt32(entity["field_4"]),
            age = Convert.ToInt32(entity["field_5"]),
            bombOutTerm = Convert.ToDecimal(entity["field_6"]),
            entity
        })
        .ToArray()
        .GroupBy(x => new
        {
            x.fund,
            x.bps,
            x.withdrawalPct,
            x.percentile,
            x.age
        }, x => new
        {
            x.bombOutTerm,
            x.entity,
        });

(2)

var query =
    from i in inputs
    join e in entities on new { i.fund, i.bps, i.withdrawalPct, i.percentile, i.age } equals e.Key
    select new { input = i, matchingRows = e };

(3)

foreach (var x in query)
{
    entitiesFound.AddRange(x.matchingRows.Select(y => y.entity));

    if (x.matchingRows.Count() == 0)
    {
        rowsToAdd.Add(x.input.record);
    }
    else if (x.matchingRows.Count() == 1)
    {
        if (x.matchingRows.First().bombOutTerm != x.input.bombOutTerm)
        {
            rowsToUpdate.Add(x.input.record);
            entitiesToUpdate.Add(x.matchingRows.First().entity);
        }
    }
    else
    {
        entitiesToDelete.AddRange(x.matchingRows.Select(y => y.entity));
        rowsToAdd.Add(x.input.record);
    }
}

I would suspect that this will be the among the fastest approaches presented.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM