I am having trouble with the speed of the search function that I wrote. The function steps are described below:
The function aim is to be able to trace a link between tables where the link is direct or has multiple degrees of separation. The level of recursion is a fixed integer value.
My problem is that any time I try to run this function for two levels of search depth (wouldn't dare try deeper at this stage), the job runs out of memory, or I lose patience. I waited for 17mins before the job ran out of memory once.
The average number of columns per table is 28 and the standard deviation is 34.
Here is a diagram showing examples of the various links that can be made between tables:
Here is my code:
private void FindLinkingTables(List<TableColumns> sourceList, TableSearchNode parentNode, string targetTable, int maxSearchDepth)
{
if (parentNode.Level < maxSearchDepth)
{
IEnumerable<string> tableColumns = sourceList.Where(x => x.Table.Equals(parentNode.Table)).Select(x => x.Column);
foreach (string sourceColumn in tableColumns)
{
string shortName = sourceColumn.Substring(1);
IEnumerable<TableSearchNode> tables = sourceList.Where(
x => x.Column.Substring(1).Equals(shortName) && !x.Table.Equals(parentNode.Table) && !parentNode.Ancenstory.Contains(x.Table)).Select(
x => new TableSearchNode { Table = x.Table, Column = x.Column, Level = parentNode.Level + 1 });
foreach (TableSearchNode table in tables)
{
parentNode.AddChildNode(sourceColumn, table);
if (!table.Table.Equals(targetTable))
{
FindLinkingTables(sourceList, table, targetTable, maxSearchDepth);
}
else
{
table.NotifySeachResult(true);
}
}
}
}
}
EDIT: separated out TableSearchNode logic and added property and method for completeness
//TableSearchNode
public Dictionary<string, List<TableSearchNode>> Children { get; private set; }
//TableSearchNode
public List<string> Ancenstory
{
get
{
Stack<string> ancestory = new Stack<string>();
TableSearchNode ancestor = ParentNode;
while (ancestor != null)
{
ancestory.Push(ancestor.tbl);
ancestor = ancestor.ParentNode;
}
return ancestory.ToList();
}
}
//TableSearchNode
public void AddChildNode(string referenceColumn, TableSearchNode childNode)
{
childNode.ParentNode = this;
List<TableSearchNode> relatedTables = null;
Children.TryGetValue(referenceColumn, out relatedTables);
if (relatedTables == null)
{
relatedTables = new List<TableSearchNode>();
Children.Add(referenceColumn, relatedTables);
}
relatedTables.Add(childNode);
}
Thanks in advance for your help!
You really wasting a lot of memory. What immediately comes to mind:
First of all replace incoming List<TableColumns> sourceList
with ILookup<string, TableColumns>
. You should do this once before calling FindLinkingTables
:
ILookup<string, TableColumns> sourceLookup = sourceList.ToLookup(s => s.Table); FindLinkingTables(sourceLookup, parentNode, targetTable, maxSearchDepth);
Do not call .ToList()
if do not really need it. For example, if you are going only to enumerate all children of resulting list once, you do not need it. So your main function will looks like this:
private void FindLinkingTables(ILookup<string, TableColumns> sourceLookup, TableSearchNode parentNode, string targetTable, int maxSearchDepth) { if (parentNode.Level < maxSearchDepth) { var tableColumns = sourceLookup[parentNode.Table].Select(x => x.Column); foreach (string sourceColumn in tableColumns) { string shortName = sourceColumn.Substring(1); var tables = sourceLookup .Where( group => !group.Key.Equals(parentNode.Table) && !parentNode.Ancenstory.Contains(group.Key)) .SelectMany(group => group) .Where(tableColumn => tableColumn.Column.Substring(1).Equals(shortName)) .Select( x => new TableSearchNode { Table = x.Table, Column = x.Column, Level = parentNode.Level + 1 }); foreach (TableSearchNode table in tables) { parentNode.AddChildNode(sourceColumn, table); if (!table.Table.Equals(targetTable)) { FindLinkingTables(sourceLookup, table, targetTable, maxSearchDepth); } else { table.NotifySeachResult(true); } } } } }
[Edit]
Also in order to speedup remaining complex LINQ query, you can prepare yet another ILookup
:
ILookup<string, TableColumns> sourceColumnLookup = sourceLlist .ToLookup(t => t.Column.Substring(1)); //... private void FindLinkingTables( ILookup<string, TableColumns> sourceLookup, ILookup<string, TableColumns> sourceColumnLookup, TableSearchNode parentNode, string targetTable, int maxSearchDepth) { if (parentNode.Level >= maxSearchDepth) return; var tableColumns = sourceLookup[parentNode.Table].Select(x => x.Column); foreach (string sourceColumn in tableColumns) { string shortName = sourceColumn.Substring(1); var tables = sourceColumnLookup[shortName] .Where(tableColumn => !tableColumn.Table.Equals(parentNode.Table) && !parentNode.AncenstoryReversed.Contains(tableColumn.Table)) .Select( x => new TableSearchNode { Table = x.Table, Column = x.Column, Level = parentNode.Level + 1 }); foreach (TableSearchNode table in tables) { parentNode.AddChildNode(sourceColumn, table); if (!table.Table.Equals(targetTable)) { FindLinkingTables(sourceLookup, sourceColumnLookup, table, targetTable, maxSearchDepth); } else { table.NotifySeachResult(true); } } } }
I've checked your Ancestory
property. If IEnumerable<string>
is enough for your needs check this implementation:
public IEnumerable<string> AncenstoryEnum { get { return AncenstoryReversed.Reverse(); } } public IEnumerable<string> AncenstoryReversed { get { TableSearchNode ancestor = ParentNode; while (ancestor != null) { yield return ancestor.tbl; ancestor = ancestor.ParentNode; } } }
I've managed to refactor your FindLinkingTables
code down to this:
private void FindLinkingTables(
List<TableColumns> sourceList, TableSearchNode parentNode,
string targetTable, int maxSearchDepth)
{
if (parentNode.Level < maxSearchDepth)
{
var sames = sourceList.Where(w => w.Table == parentNode.Table);
var query =
from x in sames
join y in sames
on x.Column.Substring(1) equals y.Column.Substring(1)
where !parentNode.Ancenstory.Contains(y.Table)
select new TableSearchNode
{
Table = x.Table,
Column = x.Column,
Level = parentNode.Level + 1
};
foreach (TableSearchNode z in query)
{
parentNode.AddChildNode(z.Column, z);
if (z.Table != targetTable)
{
FindLinkingTables(sourceList, z, targetTable, maxSearchDepth);
}
else
{
z.NotifySeachResult(true);
}
}
}
}
It appears to me that your logic in the where !parentNode.Ancenstory.Contains(y.Table)
part of the query is flawed. I think you need to rethink your search operation here and see what you come up with.
There are a few things that stand out to me looking at this source method:
In your Where
clause you make a call to parentNode.Ancenstory
; this has logarithmic run time by itself, then you make a call to .Contains
on the List<string>
it returns, which is another logarithmic call (it's linear, but the list has a logarithmic number of elements). What you're doing here is checking for cycles in your graph. These costs can be made constant by adding a field to TableColumns.Table
which stores information on how that Table
has been processed by the algorithm (alternatively, you could use a Dictionary<Table, int>
, to avoid adding a field to the object). Typically, in a DFS algorithm, this field is either White, Grey, or Black - White for unprocessed (you haven't seen that Table
before), Grey for an ancestor of the Table
currently being processed, and Black for when you're done processing that Table
and all of its children. To update your code to do this, it'd look like:
foreach (string sourceColumn in tableColumns) { string shortName = sourceColumn.Substring(1); IEnumerable<TableSearchNode> tables = sourceList.Where(x => x.Column[0].Equals(shortName) && x.Color == White) .Select(x => new TableSearchNode { Table = x.Table, Column = x.Column, Level = parentNode.Level + 1 }); foreach (TableSearchNode table in tables) { parentNode.AddChildNode(sourceColumn, table); table.Color = Grey; if (!table.Table.Equals(targetTable)) { FindLinkingTables(sourceList, table, targetTable, maxSearchDepth); } else { table.NotifySeachResult(true); } table.Color = Black; } }
As you mentioned above, you're running out of memory. The easiest fix for this is to remove the recursive call (which is acting as an implicit stack) and replace it with an explicit Stack
data structure, removing the recursion. Additionally, this changes the recursion to a loop, which C# is better at optimizing.
private void FindLinkingTables(List<TableColumns> sourceList, TableSearchNode root, string targetTable, int maxSearchDepth) { Stack<TableSearchNode> stack = new Stack<TableSearchNode>(); TableSearchNode current; stack.Push(root); while (stack.Count > 0 && stack.Count < maxSearchDepth) { current = stack.Pop(); var tableColumns = sourceList.Where(x => x.Table.Equals(current.Table)) .Select(x => x.Column); foreach (string sourceColumn in tableColumns) { string shortName = sourceColumn.Substring(1); IEnumerable<TableSearchNode> tables = sourceList.Where(x => x.Column[0].Equals(shortName) && x.Color == White) .Select(x => new TableSearchNode { Table = x.Table, Column = x.Column, Level = current.Level + 1 }); foreach (TableSearchNode table in tables) { current.AddChildNode(sourceColumn, table); if (!table.Table.Equals(targetTable)) { table.Color = Grey; stack.Push(table); } else { // you could go ahead and construct the ancestry list here using the stack table.NotifySeachResult(true); return; } } } current.Color = Black; } }
Finally, we don't know how costly Table.Equals
is, but if the comparison is deep then that could be adding significant run time to your inner loop.
Okay, here is an answer which basically abandons all the code you have posted.
First, you should take your List<TableColumns>
and hash them into something that can be indexed without having to iterate over your entire list.
For this purpose, I have written a class called TableColumnIndexer
:
class TableColumnIndexer
{
Dictionary<string, HashSet<string>> tables = new Dictionary<string, HashSet<string>>();
public void Add(string tableName, string columnName)
{
this.Add(new TableColumns { Table = tableName, Column = columnName });
}
public void Add(TableColumns tableColumns)
{
if(! tables.ContainsKey(tableColumns.Table))
{
tables.Add(tableColumns.Table, new HashSet<string>());
}
tables[tableColumns.Table].Add(tableColumns.Column);
}
// .... More code to follow
Now, once you inject all your Table / Column values into this indexing class, you can invoke a recursive method to retrieve the shortest ancestry link between two tables. The implementation here is somewhat sloppy, but it is written for clarity over performance at this point:
// .... continuation of TableColumnIndexer class
public List<string> GetShortestAncestry(string parentName, string targetName, int maxDepth)
{
return GetSortestAncestryR(parentName, targetName, maxDepth - 1, 0, new Dictionary<string,int>());
}
private List<string> GetSortestAncestryR(string currentName, string targetName, int maxDepth, int currentDepth, Dictionary<string, int> vistedTables)
{
// Check if we have visited this table before
if (!vistedTables.ContainsKey(currentName))
vistedTables.Add(currentName, currentDepth);
// Make sure we have not visited this table at a shallower depth before
if (vistedTables[currentName] < currentDepth)
return null;
else
vistedTables[currentName] = currentDepth;
if (currentDepth <= maxDepth)
{
List<string> result = new List<string>();
// First check if the current table contains a reference to the target table
if (tables[currentName].Contains(targetName))
{
result.Add(currentName);
result.Add(targetName);
return result;
}
// If not try to see if any of the children tables have the target table
else
{
List<string> bestResult = null;
int bestDepth = int.MaxValue;
foreach (string childTable in tables[currentName])
{
var tempResult = GetSortestAncestryR(childTable, targetName, maxDepth, currentDepth + 1, vistedTables);
// Keep only the shortest path found to the target table
if (tempResult != null && tempResult.Count < bestDepth)
{
bestDepth = tempResult.Count;
bestResult = tempResult;
}
}
// Take the best link we found and add it to the result list
if (bestDepth < int.MaxValue && bestResult != null)
{
result.Add(currentName);
result.AddRange(bestResult);
return result;
}
// If we did not find any result, return nothing
else
{
return null;
}
}
}
else
{
return null;
}
}
}
Now all this code is just a (somewhat verbose) implementation of a shortest path algorithm which allows for circular paths and multiple paths between the source table and target table. Note that if there are two routes with the same depth between two tables, the algorithm will select only one (and not necessarily predictably).
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.