How do I speed up recursive search function?

Question

I am having trouble with the speed of the search function that I wrote. The function steps are described below:

The function begins with two table name parameters, a starting-point and a target
The function then traverses a list of table-column combinations (50,000 long) and retrieves all the combinations associated with the starting-point table.
The function then loops through each of the retrieved combinations and for each combination, it traverses the table-column combinations list once again, but this time looking for tables that match the given column.
Finally, the function loops through each of the retrieved combinations from the last step and for each combination, it checks whether the table is the same as the target table; if so it saves it, and if not it calls itself passing in the table name form that combination.

The function aim is to be able to trace a link between tables where the link is direct or has multiple degrees of separation. The level of recursion is a fixed integer value.

My problem is that any time I try to run this function for two levels of search depth (wouldn't dare try deeper at this stage), the job runs out of memory, or I lose patience. I waited for 17mins before the job ran out of memory once.

The average number of columns per table is 28 and the standard deviation is 34.

Here is a diagram showing examples of the various links that can be made between tables:

每列可以在多个表中匹配。然后，可以逐列搜索每个匹配表，以查找具有匹配列的表，依此类推

Here is my code:

private void FindLinkingTables(List<TableColumns> sourceList, TableSearchNode parentNode, string targetTable, int maxSearchDepth)
{
    if (parentNode.Level < maxSearchDepth)
    {
        IEnumerable<string> tableColumns = sourceList.Where(x => x.Table.Equals(parentNode.Table)).Select(x => x.Column);

        foreach (string sourceColumn in tableColumns)
        {
            string shortName = sourceColumn.Substring(1);

            IEnumerable<TableSearchNode> tables = sourceList.Where(
                x => x.Column.Substring(1).Equals(shortName) && !x.Table.Equals(parentNode.Table) && !parentNode.Ancenstory.Contains(x.Table)).Select(
                    x => new TableSearchNode { Table = x.Table, Column = x.Column, Level = parentNode.Level + 1 });
            foreach (TableSearchNode table in tables)
            {
                parentNode.AddChildNode(sourceColumn, table);
                if (!table.Table.Equals(targetTable))
                {
                    FindLinkingTables(sourceList, table, targetTable, maxSearchDepth);
                }
                else
                {
                    table.NotifySeachResult(true);
                }
            }
        }
    }
}

EDIT: separated out TableSearchNode logic and added property and method for completeness

//TableSearchNode
public Dictionary<string, List<TableSearchNode>> Children { get; private set; }

//TableSearchNode
public List<string> Ancenstory
{
    get
    {
        Stack<string> ancestory = new Stack<string>();
        TableSearchNode ancestor = ParentNode;
        while (ancestor != null)
        {
            ancestory.Push(ancestor.tbl);
            ancestor = ancestor.ParentNode;
        }
        return ancestory.ToList();
    }
}

//TableSearchNode
public void AddChildNode(string referenceColumn, TableSearchNode childNode)
    {
        childNode.ParentNode = this;
        List<TableSearchNode> relatedTables = null;
        Children.TryGetValue(referenceColumn, out relatedTables);
        if (relatedTables == null)
        {
            relatedTables = new List<TableSearchNode>();
            Children.Add(referenceColumn, relatedTables);
        }
        relatedTables.Add(childNode);
    }

Thanks in advance for your help!

Answer 1

You really wasting a lot of memory. What immediately comes to mind:

First of all replace incoming List<TableColumns> sourceList with ILookup<string, TableColumns> . You should do this once before calling FindLinkingTables :

 ILookup<string, TableColumns> sourceLookup = sourceList.ToLookup(s => s.Table); FindLinkingTables(sourceLookup, parentNode, targetTable, maxSearchDepth);

Do not call .ToList() if do not really need it. For example, if you are going only to enumerate all children of resulting list once, you do not need it. So your main function will looks like this:

 private void FindLinkingTables(ILookup<string, TableColumns> sourceLookup, TableSearchNode parentNode, string targetTable, int maxSearchDepth) { if (parentNode.Level < maxSearchDepth) { var tableColumns = sourceLookup[parentNode.Table].Select(x => x.Column); foreach (string sourceColumn in tableColumns) { string shortName = sourceColumn.Substring(1); var tables = sourceLookup .Where( group => !group.Key.Equals(parentNode.Table) && !parentNode.Ancenstory.Contains(group.Key)) .SelectMany(group => group) .Where(tableColumn => tableColumn.Column.Substring(1).Equals(shortName)) .Select( x => new TableSearchNode { Table = x.Table, Column = x.Column, Level = parentNode.Level + 1 }); foreach (TableSearchNode table in tables) { parentNode.AddChildNode(sourceColumn, table); if (!table.Table.Equals(targetTable)) { FindLinkingTables(sourceLookup, table, targetTable, maxSearchDepth); } else { table.NotifySeachResult(true); } } } } }

[Edit]

Also in order to speedup remaining complex LINQ query, you can prepare yet another ILookup :

 ILookup<string, TableColumns> sourceColumnLookup = sourceLlist .ToLookup(t => t.Column.Substring(1)); //... private void FindLinkingTables( ILookup<string, TableColumns> sourceLookup, ILookup<string, TableColumns> sourceColumnLookup, TableSearchNode parentNode, string targetTable, int maxSearchDepth) { if (parentNode.Level >= maxSearchDepth) return; var tableColumns = sourceLookup[parentNode.Table].Select(x => x.Column); foreach (string sourceColumn in tableColumns) { string shortName = sourceColumn.Substring(1); var tables = sourceColumnLookup[shortName] .Where(tableColumn => !tableColumn.Table.Equals(parentNode.Table) && !parentNode.AncenstoryReversed.Contains(tableColumn.Table)) .Select( x => new TableSearchNode { Table = x.Table, Column = x.Column, Level = parentNode.Level + 1 }); foreach (TableSearchNode table in tables) { parentNode.AddChildNode(sourceColumn, table); if (!table.Table.Equals(targetTable)) { FindLinkingTables(sourceLookup, sourceColumnLookup, table, targetTable, maxSearchDepth); } else { table.NotifySeachResult(true); } } } }

I've checked your Ancestory property. If IEnumerable<string> is enough for your needs check this implementation:

 public IEnumerable<string> AncenstoryEnum { get { return AncenstoryReversed.Reverse(); } } public IEnumerable<string> AncenstoryReversed { get { TableSearchNode ancestor = ParentNode; while (ancestor != null) { yield return ancestor.tbl; ancestor = ancestor.ParentNode; } } }

Answer 2

I've managed to refactor your FindLinkingTables code down to this:

private void FindLinkingTables(
    List<TableColumns> sourceList, TableSearchNode parentNode,
    string targetTable, int maxSearchDepth)
{
    if (parentNode.Level < maxSearchDepth)
    {
        var sames = sourceList.Where(w => w.Table == parentNode.Table);

        var query =
            from x in sames
            join y in sames
                on x.Column.Substring(1) equals y.Column.Substring(1)
            where !parentNode.Ancenstory.Contains(y.Table)
            select new TableSearchNode
            {
                Table = x.Table,
                Column = x.Column,
                Level = parentNode.Level + 1
            };

        foreach (TableSearchNode z in query)
        {
            parentNode.AddChildNode(z.Column, z);
            if (z.Table != targetTable)
            {
                FindLinkingTables(sourceList, z, targetTable, maxSearchDepth);
            }
            else
            {
                z.NotifySeachResult(true);
            }
        }
    }
}

It appears to me that your logic in the where !parentNode.Ancenstory.Contains(y.Table) part of the query is flawed. I think you need to rethink your search operation here and see what you come up with.

Answer 3

There are a few things that stand out to me looking at this source method:

In your Where clause you make a call to parentNode.Ancenstory ; this has logarithmic run time by itself, then you make a call to .Contains on the List<string> it returns, which is another logarithmic call (it's linear, but the list has a logarithmic number of elements). What you're doing here is checking for cycles in your graph. These costs can be made constant by adding a field to TableColumns.Table which stores information on how that Table has been processed by the algorithm (alternatively, you could use a Dictionary<Table, int> , to avoid adding a field to the object). Typically, in a DFS algorithm, this field is either White, Grey, or Black - White for unprocessed (you haven't seen that Table before), Grey for an ancestor of the Table currently being processed, and Black for when you're done processing that Table and all of its children. To update your code to do this, it'd look like:
```
 foreach (string sourceColumn in tableColumns) { string shortName = sourceColumn.Substring(1); IEnumerable<TableSearchNode> tables = sourceList.Where(x => x.Column[0].Equals(shortName) && x.Color == White) .Select(x => new TableSearchNode { Table = x.Table, Column = x.Column, Level = parentNode.Level + 1 }); foreach (TableSearchNode table in tables) { parentNode.AddChildNode(sourceColumn, table); table.Color = Grey; if (!table.Table.Equals(targetTable)) { FindLinkingTables(sourceList, table, targetTable, maxSearchDepth); } else { table.NotifySeachResult(true); } table.Color = Black; } } 
```

As you mentioned above, you're running out of memory. The easiest fix for this is to remove the recursive call (which is acting as an implicit stack) and replace it with an explicit Stack data structure, removing the recursion. Additionally, this changes the recursion to a loop, which C# is better at optimizing.

 private void FindLinkingTables(List<TableColumns> sourceList, TableSearchNode root, string targetTable, int maxSearchDepth) { Stack<TableSearchNode> stack = new Stack<TableSearchNode>(); TableSearchNode current; stack.Push(root); while (stack.Count > 0 && stack.Count < maxSearchDepth) { current = stack.Pop(); var tableColumns = sourceList.Where(x => x.Table.Equals(current.Table)) .Select(x => x.Column); foreach (string sourceColumn in tableColumns) { string shortName = sourceColumn.Substring(1); IEnumerable<TableSearchNode> tables = sourceList.Where(x => x.Column[0].Equals(shortName) && x.Color == White) .Select(x => new TableSearchNode { Table = x.Table, Column = x.Column, Level = current.Level + 1 }); foreach (TableSearchNode table in tables) { current.AddChildNode(sourceColumn, table); if (!table.Table.Equals(targetTable)) { table.Color = Grey; stack.Push(table); } else { // you could go ahead and construct the ancestry list here using the stack table.NotifySeachResult(true); return; } } } current.Color = Black; } }

Finally, we don't know how costly Table.Equals is, but if the comparison is deep then that could be adding significant run time to your inner loop.

Answer 4

Okay, here is an answer which basically abandons all the code you have posted.

First, you should take your List<TableColumns> and hash them into something that can be indexed without having to iterate over your entire list.

For this purpose, I have written a class called TableColumnIndexer :

class TableColumnIndexer
{
    Dictionary<string, HashSet<string>> tables = new Dictionary<string, HashSet<string>>();

    public void Add(string tableName, string columnName)
    {
        this.Add(new TableColumns { Table = tableName, Column = columnName });
    }

    public void Add(TableColumns tableColumns)
    {
        if(! tables.ContainsKey(tableColumns.Table))
        {
            tables.Add(tableColumns.Table, new HashSet<string>());
        }

        tables[tableColumns.Table].Add(tableColumns.Column);
    }

    // .... More code to follow

Now, once you inject all your Table / Column values into this indexing class, you can invoke a recursive method to retrieve the shortest ancestry link between two tables. The implementation here is somewhat sloppy, but it is written for clarity over performance at this point:

    // .... continuation of TableColumnIndexer class
    public List<string> GetShortestAncestry(string parentName, string targetName, int maxDepth)
    {
        return GetSortestAncestryR(parentName, targetName, maxDepth - 1, 0, new Dictionary<string,int>());
    }

    private List<string> GetSortestAncestryR(string currentName, string targetName, int maxDepth, int currentDepth, Dictionary<string, int> vistedTables)
    {
        // Check if we have visited this table before
        if (!vistedTables.ContainsKey(currentName))
            vistedTables.Add(currentName, currentDepth);

        // Make sure we have not visited this table at a shallower depth before
        if (vistedTables[currentName] < currentDepth)
            return null;
        else
            vistedTables[currentName] = currentDepth;


        if (currentDepth <= maxDepth)
        {
            List<string> result = new List<string>();

            // First check if the current table contains a reference to the target table
            if (tables[currentName].Contains(targetName))
            {
                result.Add(currentName);
                result.Add(targetName);
                return result;
            }
            // If not try to see if any of the children tables have the target table
            else
            {
                List<string> bestResult = null;
                    int bestDepth = int.MaxValue;

                foreach (string childTable in tables[currentName])
                {
                    var tempResult = GetSortestAncestryR(childTable, targetName, maxDepth, currentDepth + 1, vistedTables);

                    // Keep only the shortest path found to the target table
                    if (tempResult != null && tempResult.Count < bestDepth)
                    {
                        bestDepth = tempResult.Count;
                        bestResult = tempResult;
                    }
                }

                // Take the best link we found and add it to the result list
                if (bestDepth < int.MaxValue && bestResult != null)
                {
                    result.Add(currentName);
                    result.AddRange(bestResult);
                    return result;
                }
                // If we did not find any result, return nothing
                else
                {
                    return null;
                }
            }
        }
        else
        {
            return null;
        }
    }
}

Now all this code is just a (somewhat verbose) implementation of a shortest path algorithm which allows for circular paths and multiple paths between the source table and target table. Note that if there are two routes with the same depth between two tables, the algorithm will select only one (and not necessarily predictably).

How do I speed up recursive search function?

Question

4 answers

solution1
4 2014-06-13 12:37:53

solution2
2 2014-06-13 13:08:11

solution3
2 2014-06-13 13:20:41

solution4
2 2014-06-13 13:48:30

How do I speed up recursive search function?

Question

4 answers

solution1 4 2014-06-13 12:37:53

solution2 2 2014-06-13 13:08:11

solution3 2 2014-06-13 13:20:41

solution4 2 2014-06-13 13:48:30

solution1
4 2014-06-13 12:37:53

solution2
2 2014-06-13 13:08:11

solution3
2 2014-06-13 13:20:41

solution4
2 2014-06-13 13:48:30