简体   繁体   English

在C#中优化列表性能

[英]Optimizing list performance in C#

I am working on a project (in .NET 3.5) that reads in 2 files, then compares them and finds the missing objects. 我正在研究一个项目(在.NET 3.5中),该项目读入2个文件,然后比较它们并找到丢失的对象。

Based on this data, I need to parse it further and locate the object location. 根据这些数据,我需要进一步解析它并找到对象位置。 I'll try explaining this further: 我会尝试进一步解释:

I have 2 lists: 1 list is a very long list of all files on a server, along with their physical address on the server, or other server, this file is a little over 1 billion lines long and continuously growing (a littler ridiculous, I know). 我有2个列表:1个列表是服务器上所有文件的一个很长的列表,以及它们在服务器或其他服务器上的物理地址,这个文件长度超过10亿行且不断增长(更为荒谬,我知道)。 File size is around 160MB currently. 目前文件大小约为160MB。 The other list is a report list that shows missing files on the server. 另一个列表是一个报告列表,显示服务器上缺少的文件。 This list is miniscule compared to list 1, and is usually under 1MB in size. 与列表1相比,此列表微不足道,通常小于1MB。

I have to intersect list 2 with list 1 and determine where the missing objects are located. 我必须将列表2与列表1相交,并确定丢失的对象所在的位置。 The items in the list look like this (unfortunately it is space separated and not a CSV document): filename.extension rev rev# source server:harddriveLocation\\|filenameOnServer.extension origin 列表中的项目看起来像这样(不幸的是它是空格分隔而不是CSV文档):filename.extension rev rev#source server:harddriveLocation \\ | filenameOnServer.extension origin

Using a stream, I read in both files into separate string lists. 使用流,我将两个文件读入单独的字符串列表。 I then take a regex and parse items from list 2 into a third list that contains the filename.extension,rev and rev#. 然后我拿一个正则表达式并将列表2中的项解析成包含filename.extension,rev和rev#的第三个列表。 All this works fantastically, its the performance that is killing me. 所有这一切都非常有效,它的表现正在扼杀我。

I am hoping there is a much more efficient way to do what I am doing. 我希望有一种更有效的方式来做我正在做的事情。

foreach (String item in slMissingObjectReport)
{
    if (item.Contains(".ext1") || item.Contains(".ext2") || item.Contains(".ext3"))
    {
        if (!item.Contains("|"))
        {                                     
            slMissingObjects.Add(item + "," + slMissingObjectReport[i + 1] + "," + slMissingObjectReport[i + 2]); //object, rev, version
        }
    }

    i++;
}

int j = 1; //debug only

foreach (String item in slMissingObjects)
{
    IEnumerable<String> found = Enumerable.Empty<String>();
    Stopwatch matchTime = new Stopwatch(); //used for debugging
    matchTime.Start(); //start the stop watch

    foreach (String items in slAllObjects.Where(s => s.Contains(item.Remove(item.IndexOf(',')))))
    {
        slFoundInAllObjects.Add(item);
    }

matchTime.Stop();

tsStatus.Text = "Missing Object Count: " + slMissingObjects.Count + " | " + "All Objects count: " + slAllObjects.Count + " | Time elapsed: " + (taskTime.ElapsedMilliseconds) * 0.001 + "s | Items left: " + (slMissingObjects.Count - j).ToString();

j++;
}

taskTime.Stop();
lstStatus.Items.Add(("Time to complete all tasks: " + (taskTime.ElapsedMilliseconds) * 0.001) + "s");

This works, but since currently there are 1300 missing items in my missing objects list, it takes an average of 8 to 12 minutes to complete. 这是有效的,但由于目前缺少的对象列表中有1300个缺失的项目,因此平均需要8到12分钟才能完成。 The part that takes the longest is 最长的部分是

foreach (String items in slAllObjects.Where(s => s.Contains(item.Remove(item.IndexOf(',')))))
{
    slFoundInAllObjects.Add(item);
}

I just need a point in the correct direction along with maybe a hand on how I can improve this code I am working on. 我只需要一个指向正确方向的点,也许还有一个关于如何改进我正在研究的代码的方法。 The LINQ isn't the killer it seems, its adding it to a list that seems to kill the performance. LINQ似乎不是它的杀手,它将它添加到一个似乎会破坏性能的列表中。

Hashsets are designed specifically for this kind of task, where you have unique values and you need to compare them. Hashsets专为此类任务而设计,您可以在其中拥有唯一值,并且需要对它们进行比较。

Lists, are not. 列表,不是。 They are just arbitrary collections. 它们只是任意的集合。

My first port of call of this would be to use a HashSet<> and the various intersection methods that comes free with this. 我的第一个调用端口是使用HashSet <>以及随之而来的各种交集方法。

One improvement you can make would be to use AddRange instead of Add . 您可以做的一个改进是使用AddRange而不是Add AddRange will allow the internal list preallocate the memory it needs for the add, instead of multiple times throughout the course of your foreach loop. AddRange将允许内部列表预分配添加所需的内存,而不是在foreach循环的整个过程中多次。

IEnumerable<string> items = slAllObjects.Where(s => s.Contains(item.Remove(item.IndexOf(','));
slFoundInAllObjects.AddRange(items);

Secondly, you should probably avoid item.Remove(item.IndexOf(',') in your Where lambda, as this would cause it to be executed once for every item in the list. That value is static and you can do it once ahead of time. 其次,你应该避免使用item.Remove(item.IndexOf(',')在你的Where lambda中,因为这会导致它对列表中的每个项目执行一次。这个值是静态的,你可以提前一次时间

var itemWithoutComma = item.Remove(item.IndexOf(','));
IEnumerable<string> items = slAllObjects.Where(s => s.Contains(itemWithoutComma));
slFoundInAllObjects.AddRange(items);

There seems to be a few bottlenecks which have been pointed out. 似乎有一些瓶颈已被指出。

If I understand correctly you are: 如果我理解正确你是:

  1. Reading two files into 2 lists. 将两个文件读入2个列表。 O(K) 好)
  2. Iterating over one list (O(n)) and searching a matching item in the other list (O(m)). 迭代一个列表(O(n))并搜索另一个列表中的匹配项(O(m))。
  3. Creating a new list containing these matches. 创建包含这些匹配项的新列表。 (O(n)) (上))

So you have something of order: O(K + m * n * n) . 所以你有一些秩序: O(K + m * n * n) The bottlenecks happens on steps 2 and 3 (the inner loop in your code). 瓶颈发生在第2步和第3步(代码中的内部循环)。

Solution: 解:

  1. The collection you are searching through ( slAllObjects I think) should be something you can search quickly so either use a hash set or sort this once and use a binary search to find items in this collection. 您正在搜索的集合(我认为是slAllObjects )应该是您可以快速搜索的内容,因此要么使用哈希集,要么对此进行排序,并使用二进制搜索来查找此集合中的项目。
  2. Preallocate the list you are creating. 预分配您正在创建的列表。 You know the size in advance so set the Capacity to match. 您事先知道大小,因此请将容量设置为匹配。

This solution should reduce O(n^2) * O(m) to O(n) * O(k) if you use hash set or O(n) * log(m) if you sort the list. 如果对列表进行排序,则此解决方案应将O(n^2) * O(m)O(n) * O(k)如果使用散列集或O(n) * log(m)

First stop, don't use a List. 第一站,不要使用List。 Use HashSets for quicker insert and comparisons. 使用HashSets可以更快地插入和比较。

Next up, determine if the lists are in a pre-sorted order, if they are, then you can quickly read both files at the same time, and only do a single pass through each and never have to keep them in memory at all. 接下来,确定列表是否处于预先排序的顺序,如果它们是,则可以同时快速读取这两个文件,并且只进行一次通过,而不必将它们保留在内存中。

If all else fails, look into using LINQ's Intersects method which likely will perform much better than your home grown version of it. 如果所有其他方法都失败了,请考虑使用LINQ的Intersects方法,该方法可能比您的本土版本更好。

In addition to what has already been suggested, I would consider the use of trees. 除了已经提出的建议外,我还会考虑使用树木。 If I understood correctly, there is some sort of hierarchy (ie: server, file path, file name, etc) in the file names, right? 如果我理解正确,文件名中有某种层次结构(即:服务器,文件路径,文件名等),对吗? By using a tree you reduce a lot the search space in each step. 通过使用树,您可以在每个步骤中减少很多搜索空间。

Also, if you use a Dictionary<String, Node> in each node, you can reduce the search time, which becomes O(1) considering a constant number of hierarchy levels. 此外,如果在每个节点中使用Dictionary<String, Node> ,则可以减少搜索时间,考虑到恒定数量的层次结构级别,搜索时间变为O(1)

Also, if you decide to use arrays or array lists, avoid foreach and use for as it should be faster (no iterator used, so, for array lists at least, should be faster). 此外,如果您决定使用数组或数组列表,请避免使用foreach并使用for因为它应该更快(不使用迭代器,因此,至少对于数组列表应该更快)。

Let me know if anything is unclear. 如果有什么不清楚,请告诉我。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM