简体   繁体   English

如何加快LINQ WHERE?

[英]How to speed up LINQ WHERE?

I have run a profiler on my .NET winforms app (compiled with .NET 4.7.1) and it is pointing at the following function as consuming 73% of my application's CPU time, which seems like far too much for a simple utility function: 我已经在.NET winforms应用程序(与.NET 4.7.1编译)上运行了探查器,它指出以下功能占用了我的应用程序CPU时间的73%,对于一个简单的实用程序功能来说,这似乎太多了:

public static bool DoesRecordExist(string keyColumn1, string keyColumn2, string keyColumn3,
        string keyValue1, string keyValue2, string keyValue3, DataTable dt)
{
    if (dt != null && dt.Rows.Count > 0) {
        bool exists = dt.AsEnumerable()
            .Where(r =>
                string.Equals(SafeTrim(r[keyColumn1]), keyValue1, StringComparison.CurrentCultureIgnoreCase) &&
                string.Equals(SafeTrim(r[keyColumn2]), keyValue2, StringComparison.CurrentCultureIgnoreCase) &&
                string.Equals(SafeTrim(r[keyColumn3]), keyValue3, StringComparison.CurrentCultureIgnoreCase)
            )
            .Any();
        return exists;
    } else {
        return false;
    }
}

The purpose of this function is to pass in some key column names and matching key values, and checking whether any matching record exists in the in-memory c# DataTable . 该函数的目的是传递一些键列名称和匹配的键值,并检查内存c# DataTable是否存在任何匹配的记录。

My app is processing hundreds of thousands of records and for each record, this function must be called multiple times. 我的应用程序正在处理数十万条记录,对于每条记录,此函数必须多次调用。 The app is doing a lot of inserts, and before any insert, it must check whether that record already exists in the database. 该应用程序执行了大量插入操作,并且在插入之前,它必须检查该记录是否已存在于数据库中。 I figured that an in-memory check against the DataTable would be much faster than going back to the physical database each time, so that's why I'm doing this in-memory check. 我发现对DataTable的内存中检查比每次返回物理数据库都要快得多,所以这就是为什么我要进行此内存中检查。 Each time I do a database insert, I do a corresponding insert into the DataTable , so that subsequent checks as to whether the record exists will be accurate. 每次我执行数据库插入操作时,都会在DataTable进行相应的插入操作,以便后续检查记录是否存在将是准确的。

So to my question: Is there a faster approach? 所以我的问题有没有更快的方法? (I don't think I can avoid checking for record existence each and every time, else I'll end up with duplicate inserts and key violations.) (我认为我无法避免每次都检查记录是否存在,否则我将得到重复的插入和键冲突。)

EDIT #1 In addition to trying the suggestions that have been coming in, which I'm trying now, it occurred to me that I should also maybe do the .AsEnumerable() only once and pass in the EnumerableRowCollection<DataRow> instead of the DataTable . 编辑#1除了尝试我现在​​正在尝试的建议外,我想到我也应该只做一次.AsEnumerable()并传递EnumerableRowCollection<DataRow>而不是DataTable Do you think this will help? 您认为这会有所帮助吗?

EDIT #2 I just did a controlled test and found that querying the database directly to see if a record already exists is dramatically slower than doing an in-memory lookup. 编辑#2我只是做了一个受控测试,发现直接查询数据库以查看是否已经存在一条记录要比进行内存中查找得多。

Your solution find all occurences which evaluates true in the condition and then you ask if there is any. 您的解决方案会找到所有在条件中评估为真的事件,然后询问是否存在。 Instead use Any directly. 而是直接使用Any。 Replace Where with Any. 将Any替换为Any。 It will stop processing when hits first true evaulation of the condition. 首次真正达到条件评估时,它将停止处理。

bool exists = dt.AsEnumerable().Any(r => condition);

You should try parallel execution, this should be a very good case for that as you mentioned you are working with a huge set, and no orderliness is needed if you just want to check if a record already exists. 您应该尝试并行执行,这应该是一个很好的例子,因为您提到要使用庞大的集合,并且如果您只想检查记录是否已存在,则不需要有序。

bool exists = dt.AsEnumerable().AsParallel().Any((r =>
            string.Equals(SafeTrim(r[keyColumn1]), keyValue1, StringComparison.CurrentCultureIgnoreCase) &&
            string.Equals(SafeTrim(r[keyColumn2]), keyValue2, StringComparison.CurrentCultureIgnoreCase) &&
            string.Equals(SafeTrim(r[keyColumn3]), keyValue3, StringComparison.CurrentCultureIgnoreCase)
        )

It might be that you want to transpose your data structure. 可能是您想转置数据结构。 Instead of having a DataTable where each row has keyColumn1 , keyColumn2 and keyColumn3 , have 3 HashSet<string> , where the first contains all of the keyColumn1 values, etc. 而不是具有每个表都有keyColumn1keyColumn2keyColumn3 ,具有3 HashSet<string>的DataTable,其中第一行包含所有keyColumn1值, keyColumn1

Doing this should be a lot faster than iterating through each of the rows: 这样做比遍历每一行要快得多:

var hashSetColumn1 = new HashSet<string>(
    dt.Rows.Select(x => x[keyColumn1]),
   StringComparison.CurrentCultureIgnoreCase);

var hashSetColumn2 = new HashSet<string>(
    dt.Rows.Select(x => x[keyColumn2]),
   StringComparison.CurrentCultureIgnoreCase);

var hashSetColumn3 = new HashSet<string>(
    dt.Rows.Select(x => x[keyColumn3]),
   StringComparison.CurrentCultureIgnoreCase);

Obviously, create these once, and then maintain them (as you're currently maintaining your DataTable). 显然,只需创建一次,然后进行维护(因为您当前正在维护DataTable)。 They're expensive to create, but cheap to query. 它们创建起来很昂贵,但查询却很便宜。

Then: 然后:

bool exists = hashSetColumn1.Contains(keyValue1) &&
    hashSetColumn2.Contains(keyValue2) &&
    hashSetColumn3.Contains(keyValue3);

Alternatively (and more cleanly), you can define your own struct which contains values from the 3 columns, and use a single HashSet: 另外(更简洁),您可以定义自己的结构,其中包含3列中的值,并使用单个HashSet:

public struct Row : IEquatable<Row>
{
    // Convenience
    private static readonly IEqualityComparer<string> comparer = StringComparer.CurrentCultureIngoreCase;

    public string Value1 { get; }
    public string Value2 { get; }
    public string Value3 { get; }

    public Row(string value1, string value2, string value3)
    {
        Value1 = value1;
        Value2 = value2;
        Value3 = value3;
    }

    public override bool Equals(object obj) => obj is Row row && Equals(row);

    public bool Equals(Row other)
    {
        return comparer.Equals(Value1, other.Value1) &&
               comparer.Equals(Value2, other.Value2) &&
               comparer.Equals(Value3, other.Value3);
    }

    public override int GetHashCode()
    {
        unchecked
        {
            int hash = 17;
            hash = hash * 23 + comparer.GetHashCode(Value1);
            hash = hash * 23 + comparer.GetHashCode(Value2);
            hash = hash * 23 + comparer.GetHashCode(Value3);
            return hash;
        }
    }

    public static bool operator ==(Row left, Row right) => left.Equals(right);
    public static bool operator !=(Row left, Row right) => !(left == right);
}

Then you can make a: 然后,您可以进行以下操作:

var hashSet = new HashSet<Row>(dt.Select(x => new Row(x[keyColumn1], x[keyColumn2], x[keyColumn3]));

And cache that. 并缓存它。 Query it like: 查询如下:

hashSet.Contains(new Row(keyValue1, keyValue2, keyValue3));

I suggest that you are keeping the key columns of the existing records in a HashSet . 我建议您将现有记录的关键列保留在HashSet I'm using tuples here, but you could also create your own Key struct or class by overriding GetHashCode and Equals . 我在这里使用元组,但是您也可以通过重写GetHashCodeEquals创建自己的Key结构或类。

private HashSet<(string, string, string)> _existingKeys =
    new HashSet<(string, string, string)>();

Then you can test the existence of a key very quickly with 然后,您可以使用以下命令快速测试密钥的存在

if (_existingKeys.Contains((keyValue1, keyValue2, keyValue3))) {
    ...
}

Don't forget to keep this HashSet in sync with your additions and deletions. 不要忘记使此HashSet与您的添加和删除保持同步。 Note that tuples cannot be compared with CurrentCultureIgnoreCase . 注意,元组不能与CurrentCultureIgnoreCase进行比较。 Therefore either convert all the keys to lower case, or use the custom struct approach where you can use the desired comparison method. 因此,要么将所有键都转换为小写字母,要么使用自定义结构方法,可以在其中使用所需的比较方法。

public readonly struct Key
{
    public Key(string key1, string key2, string key3) : this()
    {
        Key1 = key1?.Trim() ?? "";
        Key2 = key2?.Trim() ?? "";
        Key3 = key3?.Trim() ?? "";
    }

    public string Key1 { get; }
    public string Key2 { get; }
    public string Key3 { get; }

    public override bool Equals(object obj)
    {
        if (!(obj is Key)) {
            return false;
        }

        var key = (Key)obj;
        return
            String.Equals(Key1, key.Key1, StringComparison.CurrentCultureIgnoreCase) &&
            String.Equals(Key2, key.Key2, StringComparison.CurrentCultureIgnoreCase) &&
            String.Equals(Key3, key.Key3, StringComparison.CurrentCultureIgnoreCase);
    }

    public override int GetHashCode()
    {
        int hashCode = -2131266610;
        unchecked {
            hashCode = hashCode * -1521134295 + StringComparer.CurrentCultureIgnoreCase.GetHashCode(Key1);
            hashCode = hashCode * -1521134295 + StringComparer.CurrentCultureIgnoreCase.GetHashCode(Key2);
            hashCode = hashCode * -1521134295 + StringComparer.CurrentCultureIgnoreCase.GetHashCode(Key3);
        }
        return hashCode;
    }
}

Another question is whether it is a good idea to use the current culture when comparing db keys. 另一个问题是在比较数据库密钥时使用当前的文化是否是一个好主意。 Users with different cultures might get different results. 具有不同文化背景的用户可能会得到不同的结果。 Better explicitly specify the same culture used by the db. 更好地明确指定数据库使用的相同区域性。

In some cases using LINQ won't optimize as good as a sequential query, so you might be better of writing the query just the old-fashined way: 在某些情况下,使用LINQ不能像顺序查询那样优化,因此以老式的方式编写查询可能会更好:

public static bool DoesRecordExist(string keyColumn1, string keyColumn2, string keyColumn3,
        string keyValue1, string keyValue2, string keyValue3, DataTable dt)
{
    if (dt != null) 
    {
        foreach (var r in dt.Rows)
        {
            if (string.Equals(SafeTrim(r[keyColumn1]), keyValue1, StringComparison.CurrentCultureIgnoreCase) &&
                string.Equals(SafeTrim(r[keyColumn2]), keyValue2, StringComparison.CurrentCultureIgnoreCase) &&
                string.Equals(SafeTrim(r[keyColumn3]), keyValue3, StringComparison.CurrentCultureIgnoreCase)
            {
                return true;
            }
        }
    }
    return false;
}

But there might be more structural improvements, but this depends on the situation whether you can use it. 但是可能会有更多的结构上的改进,但这取决于您是否可以使用它。

Option 1: Making the selection already in the database You are using a DataTable , so there is a chance that you fetch the data from the database. 选项1:已经在数据库中进行选择您正在使用DataTable ,因此有可能从数据库中获取数据。 If you have a lot of records, then it might make more sense to move this check to the database. 如果您有很多记录,则将此检查移至数据库可能更有意义。 When using the proper indexes it might be way faster then an in-memory tablescan. 当使用适当的索引时,它可能比内存中的表扫描要快得多。

Option 2: Replace string.Equals+SafeTrim with a custom method You are using SafeTrim up to three times per row, which creates a lot of new strings. 选项2:使用自定义方法替换string.Equals+SafeTrim您每行最多使用SafeTrim 3次,这会创建许多新字符串。 When you create your own method that compares both strings (string.Equals) with respect to leading/trailing whitespaces (SafeTrim), but without creating a new string then this could be way faster, reduce memory load and reduce garbage collection. 当您创建自己的方法时,将两个字符串(string.Equals)相对于前导/尾随空白(SafeTrim)进行比较,但创建新字符串,则这样做可能会更快,减少内存负载并减少垃圾收集。 If the implementation is good enough to inline, then you'll gain a lot of performance. 如果实现足以内联,那么您将获得很多性能。

Option 3: Check the columns in the proper order Make sure you use the proper order and specify the column that has the least probability to match as keyColumn1 . 选项3:以正确的顺序检查列确保使用正确的顺序,并将匹配可能性最小的列指定为keyColumn1 This will make the if-statement result to false sooner. 这将使if语句结果更快地变为假。 If keyColumn1 matches in 80% of the cases, then you need to perform a lot more comparisons. 如果keyColumn1在80%的情况下匹配,那么您需要执行更多的比较。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM