简体   繁体   English

如何针对大型数据集优化比较性能?

[英]How to optimize the performance of comparisons for large data-sets?

I have a method that is pretty slow and I am not sure how to optimize this. 我有一个非常慢的方法,我不确定如何优化它。 I also don't really know how LINQ works, so if the solution is to use LINQ, please explain. 我也不太了解LINQ的工作原理,因此,如果解决方案是使用LINQ,请解释一下。 Thanks a lot. 非常感谢。

The DataTable dtExcel in the parameter of the method contains first part of data, the other is _dt from database. 该方法的参数中的DataTable dtExcel包含第一部分数据,另一部分是来自数据库的_dt The data that runs through the two for loops is about following: 700 ( dtExcel ) * 10,000 ( _dt ) = 7,000,000 comparisons. 通过两个for循环运行的数据大约如下:700( dtExcel )* 10,000( _dt )= 7,000,000个比较。

Here the code: 这里的代码:

public async Task<DataTable> GetAdressesFromDB(DataTable dtExcel)
{
    try
    {
        return await Task.Run(() =>
        {
            CurrentProgress = 0;
            ProgressbarDBVisible = true;

            _dtFoundDuplicates.Clear();

            _dt = new DataTable();

            _dt = DBConn.GetAllAddresses(dtExcel);

            ProgressMaximum = dtExcel.Rows.Count;

            for (int i = 0; i < dtExcel.Rows.Count; i++)
            {
                CurrentProgress++;
                for (int y = 0; y < _dt.Rows.Count; y++)
                {
                    // Criteria to check duplicates
                    string compareAdressExcel = "";
                    string compareAdressDB = "";

                    // Get the setted filter criteria and create both excel and db compare strings
                    string[] criteriaFields = ConfigurationManager.AppSettings["strFilter"].Split(',');
                    foreach (String cField in criteriaFields)
                    {
                        string cFieldTrimmed = cField.Trim();
                        if (cFieldTrimmed == "Strasse")
                        {
                            compareAdressExcel += dtExcel.Rows[i][cFieldTrimmed].ToString()
                                                 .ToLower()
                                                 .Replace(" ", "")
                                                 .Replace("str", "strasse")
                                                 .Replace("straße", "strasse")
                                                 .Replace("str.", "strasse");
                            compareAdressDB += _dt.Rows[y][cFieldTrimmed].ToString()
                                               .Replace(" ", "")
                                               .ToLower()
                                               .Replace("str", "strasse")
                                               .Replace("straße", "strasse")
                                               .Replace("str.", "strasse");
                        }
                        else
                        {
                            compareAdressExcel += dtExcel.Rows[i][cFieldTrimmed].ToString().Replace(" ", "").ToLower();
                            compareAdressDB += _dt.Rows[y][cFieldTrimmed].ToString().Replace(" ", "").ToLower();
                        }
                    }

                    // If the company doesn't exists in Database, the contact to that company found in excel
                    // automatically won't exist either. Otherwise, check if contact exists.
                    if (compareAdressExcel == compareAdressDB)
                    {
                        string strOneExistTwoNot = "2";

                        if (!string.IsNullOrWhiteSpace(dtExcel.Rows[i]["FirstName"].ToString().Trim()) && 
                            !string.IsNullOrWhiteSpace(dtExcel.Rows[i]["LastName"].ToString().Trim()))
                        {
                            strOneExistTwoNot = _crm.CheckContactExists(Convert.ToInt32(_dt.Rows[y]["AdressNummer"].ToString().Trim()),
                                                dtExcel.Rows[i]["FirstName"].ToString().Trim(), 
                                                dtExcel.Rows[i]["LastName"].ToString().Trim());
                        }

                        // Check if CheckContactExsists was successful
                        if (strOneExistTwoNot != "1" && strOneExistTwoNot != "2")
                        {
                            throw new Exception(strOneExistTwoNot);
                        }

                        // If Contact exists, mark row and add duplicate row,
                        // otherwise only add duplicate row
                        if (strOneExistTwoNot == "1")
                        {
                            dtExcel.Rows[i]["ContactExists"] = 1;
                            _dtFoundDuplicates.Rows.Add(dtExcel.Rows[i]["ID"], _dt.Rows[y]["AdressNummer"], "1");
                        }
                        else
                        {
                            _dtFoundDuplicates.Rows.Add(dtExcel.Rows[i]["ID"], _dt.Rows[y]["AdressNummer"], "0");
                        }
                        dtExcel.Rows[i]["AdressExists"] = 1;
                    }
                }
            }
            ProgressbarDBVisible = false;
            return dtExcel;
        });
    }
    catch (Exception ex)
    {
        throw ex;
    }
}

Edit: 编辑:

Alright, so with the help of @dlxeon s answer I tried to normalize my data outside of my two for loops. 好的,因此,在@dlxeon的答案的帮助下,我尝试对两个for循环之外的数据进行规范化。 I also tried to use a Dictonary to improve comparing speed. 我还尝试使用字典来提高比较速度。 What I can't do right now is normalizing the database and making single statements instead of retrieving whole table. 我现在不能做的是规范化数据库并制作单个语句,而不是检索整个表。 Thank you all for helping. 谢谢大家的帮助。 Please tell me if there is still room from improvement in code . 请告诉我代码改进是否还有余地。

New code: 新代码:

public async Task<DataTable> GetAdressesFromDB(DataTable dtExcel)
        {
            try
            {
                return await Task.Run(() =>
                {
                    CurrentProgress = 0;
                    ProgressbarDBVisible = true;

                    _dtFoundDuplicates.Clear();

                    _dt = DBConn.GetAllAddresses();

                    ProgressMaximum = dtExcel.Rows.Count;

                    // Normalization
                    string[] criteriaFields = ConfigurationManager.AppSettings["strFilter"].Split(',').Select(x => x.Trim()).ToArray();

                    Dictionary<int, string> excelAddresses = new Dictionary<int, string>();
                    for (int i = 0; i < dtExcel.Rows.Count; i++)
                    {
                        StringBuilder compareAdressExcel = new StringBuilder();
                        foreach (String cFieldTrimmed in criteriaFields)
                        {
                            if (cFieldTrimmed == "Strasse")
                            {
                                var normalizedValue = dtExcel.Rows[i][cFieldTrimmed].ToString()
                                    .ToLower()
                                    .Replace(" ", "")
                                    .Replace("str", "strasse")
                                    .Replace("straße", "strasse")
                                    .Replace("str.", "strasse");
                                compareAdressExcel.Append(normalizedValue);
                            }
                            else
                            {
                                compareAdressExcel.Append(dtExcel.Rows[i][cFieldTrimmed].ToString().Replace(" ", "").ToLower());
                            }
                        }
                        excelAddresses.Add(i, compareAdressExcel.ToString());
                    }
                    Dictionary<int, string> dbAddresses = new Dictionary<int, string>();
                    for (int i = 0; i < _dt.Rows.Count; i++)
                    {
                        StringBuilder compareAdressDB = new StringBuilder();
                        foreach (String cFieldTrimmed in criteriaFields)
                        {
                            if (cFieldTrimmed == "Strasse")
                            {
                                var normalizedValue = _dt.Rows[i][cFieldTrimmed].ToString()
                                    .ToLower()
                                    .Replace(" ", "")
                                    .Replace("str", "strasse")
                                    .Replace("straße", "strasse")
                                    .Replace("str.", "strasse");
                                compareAdressDB.Append(normalizedValue);
                            }
                            else
                            {
                                compareAdressDB.Append(_dt.Rows[i][cFieldTrimmed].ToString().Replace(" ", "").ToLower());
                            }
                        }
                        dbAddresses.Add(i, compareAdressDB.ToString());
                    }

                    foreach (var exAdd in excelAddresses)
                    {
                        CurrentProgress++;

                        foreach (var dbAdd in dbAddresses)
                        {
                            // If the company doesn't exists in Database, the contact to that company found in excel
                            // automatically won't exist either. Otherwise, check if contact exists.
                            if (exAdd.Value == dbAdd.Value)
                            {
                                string strOneExistTwoNot = "2";

                                if (!string.IsNullOrWhiteSpace(dtExcel.Rows[exAdd.Key]["FirstName"].ToString().Trim()) && 
                                                               !string.IsNullOrWhiteSpace(dtExcel.Rows[exAdd.Key]["LastName"].ToString().Trim()))
                                {
                                    strOneExistTwoNot = _crm.CheckContactExists(Convert.ToInt32(_dt.Rows[dbAdd.Key]["AdressNummer"].ToString().Trim()), 
                                                                                dtExcel.Rows[exAdd.Key]["FirstName"].ToString().Trim(), 
                                                                                dtExcel.Rows[exAdd.Key]["LastName"].ToString().Trim());
                                }

                                // Check if CheckContactExsists was successful
                                if (strOneExistTwoNot != "1" && strOneExistTwoNot != "2")
                                {
                                    throw new Exception(strOneExistTwoNot);
                                }

                                if (strOneExistTwoNot == "1")
                                {
                                    dtExcel.Rows[exAdd.Key]["ContactExists"] = 1;
                                    _dtFoundDuplicates.Rows.Add(dtExcel.Rows[exAdd.Key]["ID"], _dt.Rows[dbAdd.Key]["AdressNummer"], "1");
                                }
                                else
                                {
                                    _dtFoundDuplicates.Rows.Add(dtExcel.Rows[exAdd.Key]["ID"], _dt.Rows[dbAdd.Key]["AdressNummer"], "0");
                                }
                                dtExcel.Rows[exAdd.Key]["AdressExists"] = 1;
                            }
                        }
                    }
                    ProgressbarDBVisible = false;
                    return dtExcel;
                });
            }
            catch (Exception ex)
            {
                throw ex;
            }
        }

The first red flag is that you use the database only to hold data . 第一个危险信号是您仅使用数据库保存数据 That beast can search way faster than you can, if you let it . 如果允许的话 ,那头野兽可以比您更快地搜索。

For each line in your excel, build a corresponding search statement for your database and fire it. 对于excel中的每一行,为数据库建立一个相应的搜索语句并启动它。 Let your database worry about the best way to search through 10K records. 让您的数据库担心搜索10K记录的最佳方法。

The second red flag is that your normalization is not done on your existing data. 第二个危险信号是未对现有数据进行规范化。 You want to compare two streets, but you have to normalize them over and over . 您想比较两条街道,但是必须一遍又一遍地标准化它们。 Why is there no database field called "NormalizedStreet" that already has those methods applied once on insertion and where you can just fire an "equals" comparison against normalizing your input data? 为什么没有一个称为“ NormalizedStreet”的数据库字段在插入时已经应用了这些方法,而您可以在其中发起“相等”比较以规范化输入数据?

So to summarize: scrap your loop-in-a-loop. 总结一下:取消循环中的循环。 You just reinvented the database. 您刚刚重新创建了数据库。 For each row of your excel, build a statement (or two) to find out if it exists in your database. 对于您的excel的每一行,构建一个(或两个)语句以查找数据库中是否存在该语句。 If you want to be crafty, run them in parallel, but I doubt that you need that for a measly 700 input records. 如果您想变得狡猾,请并行运行它们,但是我怀疑只需要700条输入记录就可以了。

First, nvoight is right: you should normalize data in database and use its power to do searches. 首先,nvoight是正确的:您应该规范化数据库中的数据并利用其功能进行搜索。 However, if you can't change the database, then you can do improvements in your code. 但是,如果您不能更改数据库,则可以对代码进行改进。

1) Most important is to move out of loops things that can be done once. 1)最重要的是要摆脱循环,只能完成一次。 This is data normalization (replacements, tolower etc). 这是数据规范化(替换,降低等)。 Iterate through all your Excel data and database data once to build data that can be compared directly and then your your two inner loops for actual comparison. 一次遍历所有Excel数据和数据库数据以构建可以直接比较的数据,然后循环两个内部循环以进行实际比较。 Also your configuration won't be changed while you are in loop, so that can be also moved away. 同样,在循环时不会更改您的配置,因此也可以将其移开。 Avoid extra string allocations. 避免额外的字符串分配。 You can use StringBuilder to build strings instead of using += 您可以使用StringBuilder来构建字符串,而不是使用+ =

Something like that for Excel (and then similar loop for Db) 类似于Excel(然后类似的Db循环)

string[] criteriaFields = ConfigurationManager.AppSettings["strFilter"].Split(',').Select(x => x.Trim()).ToArray();
List<string> excelAddresses = new List<string>();
for (int i = 0; i < dtExcel.Rows.Count; i++)
{
    StringBuilder compareAdressExcel = new StringBuilder();
    foreach (String cFieldTrimmed in criteriaFields)
    {
        if (cFieldTrimmed == "Strasse")
        {
            var normalizedValue = dtExcel.Rows[i][cFieldTrimmed].ToString()
                .ToLower()
                .Replace(" ", "")
                .Replace("str", "strasse")
                .Replace("straße", "strasse")
                .Replace("str.", "strasse");
            compareAdressExcel.Append(normalizedValue);
        }
        else
        {
            compareAdressExcel.Append(dtExcel.Rows[i][cFieldTrimmed].ToString().Replace(" ", "").ToLower());
        }
    }
    excelAddresses.Add(compareAdressExcel.ToString());
}

Then you can use normalized values in your main loops 然后,您可以在主循环中使用归一化的值

for (int i = 0; i < dtExcel.Rows.Count; i++)
{
    CurrentProgress++;
    for (int y = 0; y < _dt.Rows.Count; y++)
    {
        // Criteria to check duplicates
        string compareAdressExcel = excelAddresses[i];
        string compareAdressDB = dbAddresses[y];

2) You can use Dictionaries or HashSets to speedup string searches and comparisons instead of loops. 2)您可以使用字典或HashSets加快字符串搜索和比较的速度,而不是循环。

3) How fast is that call to "_crm"? 3)调用“ _crm”有多快? Maybe that external call takes a while and that is reason of your slowness too. 也许外部通话会花费一些时间,这也是您运行缓慢的原因。

_crm.CheckContactExists(...)

If this is sql server then you should be using ssis. 如果这是sql server,则应该使用ssis。
It has fuzzy matching which is pretty much a must for matching records on addresses from two different sources. 它具有模糊匹配,这对于匹配来自两个不同来源的地址上的记录几乎是必须的。
I would import the data to a table using ssis as well and do any pre processing of data in the pipeline. 我也将使用ssis将数据导入到表中,并对管道中的数据进行任何预处理。
The whole thing could be run using a job then. 然后可以使用作业来运行整个过程。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何序列化派生数据集的集合 - How to serialize a collection of derived data-sets C#Oracle SQL Connection在大型数据集上非常慢 - C# Oracle SQL Connection is extremely slow on large data-sets 提高大型数据集的搜索性能 - Improving search performance in large data sets 当处理较大的数据集时,排序算法会导致堆栈溢出? - Sorting algorithm causes stack overflow when processing larger data-sets? 如何优化此创建/更新/删除比较? - How to optimize this create/update/delete comparisons? 如何优化DataGridView的性能 - How to optimize the performance of DataGridView 具有大型数据集的 Infragistics WebExcelExporter.Export(webdatagrid, worksheet) 的性能问题 - Performance issues with Infragistics WebExcelExporter.Export(webdatagrid, worksheet) with large data sets 使用 Levenshtein 距离优化来自两个大型数据集的匹配元素(将每个元素与其他元素进行比较) - Optimize matching elements from two large data sets using Levenshtein distance (comparing each element to each other element) 具有大型数据集的不完整 HttpWebResponse - Incomplete HttpWebResponse with large data sets 如何在Windows窗体C#中将大量数据保留在内存中 - How to keep large sets of data in memory in a windows form C#
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM