简体   繁体   中英

How to optimize the performance of comparisons for large data-sets?

I have a method that is pretty slow and I am not sure how to optimize this. I also don't really know how LINQ works, so if the solution is to use LINQ, please explain. Thanks a lot.

The DataTable dtExcel in the parameter of the method contains first part of data, the other is _dt from database. The data that runs through the two for loops is about following: 700 ( dtExcel ) * 10,000 ( _dt ) = 7,000,000 comparisons.

Here the code:

public async Task<DataTable> GetAdressesFromDB(DataTable dtExcel)
{
    try
    {
        return await Task.Run(() =>
        {
            CurrentProgress = 0;
            ProgressbarDBVisible = true;

            _dtFoundDuplicates.Clear();

            _dt = new DataTable();

            _dt = DBConn.GetAllAddresses(dtExcel);

            ProgressMaximum = dtExcel.Rows.Count;

            for (int i = 0; i < dtExcel.Rows.Count; i++)
            {
                CurrentProgress++;
                for (int y = 0; y < _dt.Rows.Count; y++)
                {
                    // Criteria to check duplicates
                    string compareAdressExcel = "";
                    string compareAdressDB = "";

                    // Get the setted filter criteria and create both excel and db compare strings
                    string[] criteriaFields = ConfigurationManager.AppSettings["strFilter"].Split(',');
                    foreach (String cField in criteriaFields)
                    {
                        string cFieldTrimmed = cField.Trim();
                        if (cFieldTrimmed == "Strasse")
                        {
                            compareAdressExcel += dtExcel.Rows[i][cFieldTrimmed].ToString()
                                                 .ToLower()
                                                 .Replace(" ", "")
                                                 .Replace("str", "strasse")
                                                 .Replace("straße", "strasse")
                                                 .Replace("str.", "strasse");
                            compareAdressDB += _dt.Rows[y][cFieldTrimmed].ToString()
                                               .Replace(" ", "")
                                               .ToLower()
                                               .Replace("str", "strasse")
                                               .Replace("straße", "strasse")
                                               .Replace("str.", "strasse");
                        }
                        else
                        {
                            compareAdressExcel += dtExcel.Rows[i][cFieldTrimmed].ToString().Replace(" ", "").ToLower();
                            compareAdressDB += _dt.Rows[y][cFieldTrimmed].ToString().Replace(" ", "").ToLower();
                        }
                    }

                    // If the company doesn't exists in Database, the contact to that company found in excel
                    // automatically won't exist either. Otherwise, check if contact exists.
                    if (compareAdressExcel == compareAdressDB)
                    {
                        string strOneExistTwoNot = "2";

                        if (!string.IsNullOrWhiteSpace(dtExcel.Rows[i]["FirstName"].ToString().Trim()) && 
                            !string.IsNullOrWhiteSpace(dtExcel.Rows[i]["LastName"].ToString().Trim()))
                        {
                            strOneExistTwoNot = _crm.CheckContactExists(Convert.ToInt32(_dt.Rows[y]["AdressNummer"].ToString().Trim()),
                                                dtExcel.Rows[i]["FirstName"].ToString().Trim(), 
                                                dtExcel.Rows[i]["LastName"].ToString().Trim());
                        }

                        // Check if CheckContactExsists was successful
                        if (strOneExistTwoNot != "1" && strOneExistTwoNot != "2")
                        {
                            throw new Exception(strOneExistTwoNot);
                        }

                        // If Contact exists, mark row and add duplicate row,
                        // otherwise only add duplicate row
                        if (strOneExistTwoNot == "1")
                        {
                            dtExcel.Rows[i]["ContactExists"] = 1;
                            _dtFoundDuplicates.Rows.Add(dtExcel.Rows[i]["ID"], _dt.Rows[y]["AdressNummer"], "1");
                        }
                        else
                        {
                            _dtFoundDuplicates.Rows.Add(dtExcel.Rows[i]["ID"], _dt.Rows[y]["AdressNummer"], "0");
                        }
                        dtExcel.Rows[i]["AdressExists"] = 1;
                    }
                }
            }
            ProgressbarDBVisible = false;
            return dtExcel;
        });
    }
    catch (Exception ex)
    {
        throw ex;
    }
}

Edit:

Alright, so with the help of @dlxeon s answer I tried to normalize my data outside of my two for loops. I also tried to use a Dictonary to improve comparing speed. What I can't do right now is normalizing the database and making single statements instead of retrieving whole table. Thank you all for helping. Please tell me if there is still room from improvement in code .

New code:

public async Task<DataTable> GetAdressesFromDB(DataTable dtExcel)
        {
            try
            {
                return await Task.Run(() =>
                {
                    CurrentProgress = 0;
                    ProgressbarDBVisible = true;

                    _dtFoundDuplicates.Clear();

                    _dt = DBConn.GetAllAddresses();

                    ProgressMaximum = dtExcel.Rows.Count;

                    // Normalization
                    string[] criteriaFields = ConfigurationManager.AppSettings["strFilter"].Split(',').Select(x => x.Trim()).ToArray();

                    Dictionary<int, string> excelAddresses = new Dictionary<int, string>();
                    for (int i = 0; i < dtExcel.Rows.Count; i++)
                    {
                        StringBuilder compareAdressExcel = new StringBuilder();
                        foreach (String cFieldTrimmed in criteriaFields)
                        {
                            if (cFieldTrimmed == "Strasse")
                            {
                                var normalizedValue = dtExcel.Rows[i][cFieldTrimmed].ToString()
                                    .ToLower()
                                    .Replace(" ", "")
                                    .Replace("str", "strasse")
                                    .Replace("straße", "strasse")
                                    .Replace("str.", "strasse");
                                compareAdressExcel.Append(normalizedValue);
                            }
                            else
                            {
                                compareAdressExcel.Append(dtExcel.Rows[i][cFieldTrimmed].ToString().Replace(" ", "").ToLower());
                            }
                        }
                        excelAddresses.Add(i, compareAdressExcel.ToString());
                    }
                    Dictionary<int, string> dbAddresses = new Dictionary<int, string>();
                    for (int i = 0; i < _dt.Rows.Count; i++)
                    {
                        StringBuilder compareAdressDB = new StringBuilder();
                        foreach (String cFieldTrimmed in criteriaFields)
                        {
                            if (cFieldTrimmed == "Strasse")
                            {
                                var normalizedValue = _dt.Rows[i][cFieldTrimmed].ToString()
                                    .ToLower()
                                    .Replace(" ", "")
                                    .Replace("str", "strasse")
                                    .Replace("straße", "strasse")
                                    .Replace("str.", "strasse");
                                compareAdressDB.Append(normalizedValue);
                            }
                            else
                            {
                                compareAdressDB.Append(_dt.Rows[i][cFieldTrimmed].ToString().Replace(" ", "").ToLower());
                            }
                        }
                        dbAddresses.Add(i, compareAdressDB.ToString());
                    }

                    foreach (var exAdd in excelAddresses)
                    {
                        CurrentProgress++;

                        foreach (var dbAdd in dbAddresses)
                        {
                            // If the company doesn't exists in Database, the contact to that company found in excel
                            // automatically won't exist either. Otherwise, check if contact exists.
                            if (exAdd.Value == dbAdd.Value)
                            {
                                string strOneExistTwoNot = "2";

                                if (!string.IsNullOrWhiteSpace(dtExcel.Rows[exAdd.Key]["FirstName"].ToString().Trim()) && 
                                                               !string.IsNullOrWhiteSpace(dtExcel.Rows[exAdd.Key]["LastName"].ToString().Trim()))
                                {
                                    strOneExistTwoNot = _crm.CheckContactExists(Convert.ToInt32(_dt.Rows[dbAdd.Key]["AdressNummer"].ToString().Trim()), 
                                                                                dtExcel.Rows[exAdd.Key]["FirstName"].ToString().Trim(), 
                                                                                dtExcel.Rows[exAdd.Key]["LastName"].ToString().Trim());
                                }

                                // Check if CheckContactExsists was successful
                                if (strOneExistTwoNot != "1" && strOneExistTwoNot != "2")
                                {
                                    throw new Exception(strOneExistTwoNot);
                                }

                                if (strOneExistTwoNot == "1")
                                {
                                    dtExcel.Rows[exAdd.Key]["ContactExists"] = 1;
                                    _dtFoundDuplicates.Rows.Add(dtExcel.Rows[exAdd.Key]["ID"], _dt.Rows[dbAdd.Key]["AdressNummer"], "1");
                                }
                                else
                                {
                                    _dtFoundDuplicates.Rows.Add(dtExcel.Rows[exAdd.Key]["ID"], _dt.Rows[dbAdd.Key]["AdressNummer"], "0");
                                }
                                dtExcel.Rows[exAdd.Key]["AdressExists"] = 1;
                            }
                        }
                    }
                    ProgressbarDBVisible = false;
                    return dtExcel;
                });
            }
            catch (Exception ex)
            {
                throw ex;
            }
        }

The first red flag is that you use the database only to hold data . That beast can search way faster than you can, if you let it .

For each line in your excel, build a corresponding search statement for your database and fire it. Let your database worry about the best way to search through 10K records.

The second red flag is that your normalization is not done on your existing data. You want to compare two streets, but you have to normalize them over and over . Why is there no database field called "NormalizedStreet" that already has those methods applied once on insertion and where you can just fire an "equals" comparison against normalizing your input data?

So to summarize: scrap your loop-in-a-loop. You just reinvented the database. For each row of your excel, build a statement (or two) to find out if it exists in your database. If you want to be crafty, run them in parallel, but I doubt that you need that for a measly 700 input records.

First, nvoight is right: you should normalize data in database and use its power to do searches. However, if you can't change the database, then you can do improvements in your code.

1) Most important is to move out of loops things that can be done once. This is data normalization (replacements, tolower etc). Iterate through all your Excel data and database data once to build data that can be compared directly and then your your two inner loops for actual comparison. Also your configuration won't be changed while you are in loop, so that can be also moved away. Avoid extra string allocations. You can use StringBuilder to build strings instead of using +=

Something like that for Excel (and then similar loop for Db)

string[] criteriaFields = ConfigurationManager.AppSettings["strFilter"].Split(',').Select(x => x.Trim()).ToArray();
List<string> excelAddresses = new List<string>();
for (int i = 0; i < dtExcel.Rows.Count; i++)
{
    StringBuilder compareAdressExcel = new StringBuilder();
    foreach (String cFieldTrimmed in criteriaFields)
    {
        if (cFieldTrimmed == "Strasse")
        {
            var normalizedValue = dtExcel.Rows[i][cFieldTrimmed].ToString()
                .ToLower()
                .Replace(" ", "")
                .Replace("str", "strasse")
                .Replace("straße", "strasse")
                .Replace("str.", "strasse");
            compareAdressExcel.Append(normalizedValue);
        }
        else
        {
            compareAdressExcel.Append(dtExcel.Rows[i][cFieldTrimmed].ToString().Replace(" ", "").ToLower());
        }
    }
    excelAddresses.Add(compareAdressExcel.ToString());
}

Then you can use normalized values in your main loops

for (int i = 0; i < dtExcel.Rows.Count; i++)
{
    CurrentProgress++;
    for (int y = 0; y < _dt.Rows.Count; y++)
    {
        // Criteria to check duplicates
        string compareAdressExcel = excelAddresses[i];
        string compareAdressDB = dbAddresses[y];

2) You can use Dictionaries or HashSets to speedup string searches and comparisons instead of loops.

3) How fast is that call to "_crm"? Maybe that external call takes a while and that is reason of your slowness too.

_crm.CheckContactExists(...)

If this is sql server then you should be using ssis.
It has fuzzy matching which is pretty much a must for matching records on addresses from two different sources.
I would import the data to a table using ssis as well and do any pre processing of data in the pipeline.
The whole thing could be run using a job then.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM