![](/img/trans.png)
[英]With two very large lists/collections - how to detect and/or remove duplicates efficiently
[英]How can I efficiently remove elements by index from a very large list?
我有一个非常大的整数列表(大约 20 亿个元素)和一个索引列表(几千个元素),我需要从第一个列表中删除元素。 我当前的方法是遍历第二个列表中的所有索引,将每个索引传递给第一个列表的RemoveAt()
方法:
indices.Sort();
indices.Reverse();
for (i = 0; i < indices.Count; i++)
{
largeList.RemoveAt(indices[i]);
}
但是,大约需要 2 分钟才能完成。 我真的需要更快地执行此操作。 有什么办法可以优化这个吗?
我有一个 10 核的 Intel i9X CPU,所以也许是某种并行处理方式?
每次调用RemoveAt()
时,它都必须将指定索引之后的每个元素向上移动一个元素。 如果在一个非常大的列表中有数千个元素要删除,那将导致很多很多(不必要的)移动。
我的想法是,如果您可以计算出每个移动的起始索引和长度,则可以一次处理该列表,而不会出现重叠移动。 这是通过比较要删除的索引列表中的相邻元素来完成的。 虽然这确实意味着构建要执行的移动操作的第三个List<>
,但我希望提前计划出最有效、最少的移动策略最终会得到回报(或者也许有一种方法可以在不分配任何对象的情况下做到这一点目前我还没有想到)。
您可以在下面的基准代码中看到我的实现RemoveUsingMoveStrategy()
。 在test
模式下运行下面的启动器程序,我们可以看到它——以及其他答案——在给定索引0
、 1
、 5
、 10
、 11
、 15
、 15
(重复)、 18
和19
时产生正确的结果从20
元素List<int>
...
PS> dotnet run --configuration release --framework netcoreapp3.1 -- test [snip] RemoveUsingMoveStrategy(): Elements: { 2, 3, 4, 6, 7, 8, 9, 12, 13, 14, 16, 17 } Count: 12 Sequence: Equal to control list Duration: 3.95 milliseconds Returned: Input list [snip]
我打算以十亿个元素为基准List<int>
,但是RemoveUsingRemoveAt()
- 受问题中的代码启发 - 效率太低以至于花费的时间太长,所以我只增加了 1000 万个元素。
我仍然希望运行一个不包括RemoveUsingRemoveAt()
的十亿元素基准测试。 出于这个原因,我引入了一个名为RemoveUsingListCopy()
的不太天真的实现,作为比较所有列表大小的基线。 顾名思义,它不会修改输入列表,而是创建一个应用了删除的新列表。
由于并非所有实现都要求要删除的索引列表是唯一的和/或排序的,因此我认为最公平的基准测试方法是在方法需要的基准时间中包括该开销。
我对其他答案中的代码所做的唯一其他更改是使用基准列表( DataList
和RemovalList
)并将var
更改为显式类型以使读者清楚。
RemovalListLocation
指示从DataList
中的何处删除索引。
Beginning
、 Middle
和End
,它是从该位置删除的连续RemovalListSize
大小的块。Random
,它是从常量种子生成的随机、有效、未排序、不保证唯一的索引RemovalListSize
。 为了使结果更短(呃),我选择只对Middle
值进行基准测试——认为这是一个很好的折中方案——和Random
值。
RemoveUsingRemoveAt()
很糟糕。 不要那样做。
对于较小的列表, RemoveUsingListCopy()
始终是最快的,但代价是 memory 的使用量加倍。
对于更大的列表, Vernou 的答案至少是其他答案的 5 倍,但代价是依赖于实现细节。
这只是表明您不必总是偏爱List<>
而不是数组——除非您需要它的附加功能——因为它并不总是优越的。 它隔离底层数组,防止您(没有反射)使用更高效的访问方法,如Array.Copy()
和unsafe
的代码。
Theodor Zoulias 的回答对较小的列表效果很好,但我认为必须“接触”所有 1000 万个索引——每个索引都调用HashSet<>.Contains()
并增加一个索引变量——确实对较大的列表造成了伤害。
其余的实现(包括我的)具有大致相同的性能:不是最好的,但仍然相当不错。
在benchmark
模式下运行本答案稍后定义的启动器程序,我从BenchmarkDotNet
获得这些结果......
PS> dotnet run --configuration release --framework netcoreapp3.1 -- benchmark [snip] // * Summary * BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.450 (2004/?/20H1) Intel Core i7 CPU 860 2.80GHz (Nehalem), 1 CPU, 8 logical and 4 physical cores .NET Core SDK=3.1.401 [Host] : .NET Core 3.1.7 (CoreCLR 4.700.20.36602, CoreFX 4.700.20.37001), X64 RyuJIT MediumRun : .NET Framework 4.8 (4.8.4200.0), X64 RyuJIT Job=MediumRun InvocationCount=1 IterationCount=15 LaunchCount=2 UnrollFactor=1 WarmupCount=10 | Method | Runtime | DataListSize | RemovalListSize | RemovalListLocation | Mean | Error | StdDev | Median | Ratio | RatioSD | |------------------------------ |-------------- |------------- |---------------- |-------------------- |---------------:|--------------:|--------------:|---------------:|------:|--------:| | RemoveUsingListCopy | .NET 4.8 | 10000 | 2000 | Random | 379.9 μs | 15.93 μs | 23.36 μs | 380.4 μs | 1.00 | 0.00 | | RemoveUsingRemoveAt | .NET 4.8 | 10000 | 2000 | Random | 1,463.7 μs | 93.79 μs | 137.47 μs | 1,434.7 μs | 3.87 | 0.41 | | RemoveUsingMoveStrategy | .NET 4.8 | 10000 | 2000 | Random | 635.9 μs | 19.77 μs | 29.60 μs | 624.0 μs | 1.67 | 0.14 | | Answer63496225_TheodorZoulias | .NET 4.8 | 10000 | 2000 | Random | 372.1 μs | 11.52 μs | 16.15 μs | 373.8 μs | 0.99 | 0.07 | | Answer63496768_CoolBots | .NET 4.8 | 10000 | 2000 | Random | 594.5 μs | 25.13 μs | 36.03 μs | 593.1 μs | 1.57 | 0.12 | | Answer63495657_NetMage | .NET 4.8 | 10000 | 2000 | Random | 618.8 μs | 17.53 μs | 23.99 μs | 622.4 μs | 1.65 | 0.13 | | Answer63496256_Vernou | .NET 4.8 | 10000 | 2000 | Random | 645.6 μs | 27.28 μs | 39.99 μs | 632.2 μs | 1.71 | 0.16 | | | | | | | | | | | | | | RemoveUsingListCopy | .NET Core 3.1 | 10000 | 2000 | Random | 391.5 μs | 10.39 μs | 15.55 μs | 391.6 μs | 1.00 | 0.00 | | RemoveUsingRemoveAt | .NET Core 3.1 | 10000 | 2000 | Random | 1,402.2 μs | 44.20 μs | 64.80 μs | 1,407.6 μs | 3.59 | 0.21 | | RemoveUsingMoveStrategy | .NET Core 3.1 | 10000 | 2000 | Random | 557.9 μs | 19.73 μs | 27.00 μs | 557.2 μs | 1.43 | 0.10 | | Answer63496225_TheodorZoulias | .NET Core 3.1 | 10000 | 2000 | Random | 424.3 μs | 20.90 μs | 29.30 μs | 424.2 μs | 1.09 | 0.09 | | Answer63496768_CoolBots | .NET Core 3.1 | 10000 | 2000 | Random | 535.0 μs | 19.37 μs | 27.16 μs | 537.1 μs | 1.37 | 0.08 | | Answer63495657_NetMage | .NET Core 3.1 | 10000 | 2000 | Random | 557.7 μs | 18.73 μs | 25.63 μs | 550.0 μs | 1.43 | 0.09 | | Answer63496256_Vernou | .NET Core 3.1 | 10000 | 2000 | Random | 554.2 μs | 13.82 μs | 18.45 μs | 554.0 μs | 1.42 | 0.07 | | | | | | | | | | | | | | RemoveUsingListCopy | .NET 4.8 | 10000 | 2000 | Middle | 221.6 μs | 7.25 μs | 10.63 μs | 222.5 μs | 1.00 | 0.00 | | RemoveUsingRemoveAt | .NET 4.8 | 10000 | 2000 | Middle | 1,195.3 μs | 20.01 μs | 28.69 μs | 1,187.7 μs | 5.42 | 0.30 | | RemoveUsingMoveStrategy | .NET 4.8 | 10000 | 2000 | Middle | 405.0 μs | 13.65 μs | 19.14 μs | 410.7 μs | 1.83 | 0.10 | | Answer63496225_TheodorZoulias | .NET 4.8 | 10000 | 2000 | Middle | 206.3 μs | 8.62 μs | 12.09 μs | 204.9 μs | 0.94 | 0.08 | | Answer63496768_CoolBots | .NET 4.8 | 10000 | 2000 | Middle | 427.5 μs | 15.56 μs | 22.81 μs | 435.4 μs | 1.93 | 0.13 | | Answer63495657_NetMage | .NET 4.8 | 10000 | 2000 | Middle | 405.4 μs | 13.80 μs | 19.35 μs | 403.8 μs | 1.84 | 0.11 | | Answer63496256_Vernou | .NET 4.8 | 10000 | 2000 | Middle | 413.9 μs | 15.26 μs | 20.89 μs | 419.8 μs | 1.87 | 0.12 | | | | | | | | | | | | | | RemoveUsingListCopy | .NET Core 3.1 | 10000 | 2000 | Middle | 235.2 μs | 10.73 μs | 15.73 μs | 236.2 μs | 1.00 | 0.00 | | RemoveUsingRemoveAt | .NET Core 3.1 | 10000 | 2000 | Middle | 1,345.6 μs | 32.07 μs | 43.90 μs | 1,352.7 μs | 5.77 | 0.41 | | RemoveUsingMoveStrategy | .NET Core 3.1 | 10000 | 2000 | Middle | 324.0 μs | 4.92 μs | 7.05 μs | 326.6 μs | 1.39 | 0.09 | | Answer63496225_TheodorZoulias | .NET Core 3.1 | 10000 | 2000 | Middle | 262.9 μs | 6.18 μs | 9.06 μs | 265.4 μs | 1.12 | 0.08 | | Answer63496768_CoolBots | .NET Core 3.1 | 10000 | 2000 | Middle | 333.6 μs | 10.14 μs | 13.87 μs | 340.9 μs | 1.43 | 0.11 | | Answer63495657_NetMage | .NET Core 3.1 | 10000 | 2000 | Middle | 313.5 μs | 9.05 μs | 12.69 μs | 310.5 μs | 1.34 | 0.11 | | Answer63496256_Vernou | .NET Core 3.1 | 10000 | 2000 | Middle | 332.3 μs | 6.70 μs | 8.95 μs | 331.9 μs | 1.43 | 0.09 | | | | | | | | | | | | | | RemoveUsingListCopy | .NET 4.8 | 10000000 | 2000 | Random | 253,977.1 μs | 2,721.70 μs | 3,989.43 μs | 253,809.0 μs | 1.00 | 0.00 | | RemoveUsingRemoveAt | .NET 4.8 | 10000000 | 2000 | Random | 5,191,083.4 μs | 13,200.66 μs | 18,931.99 μs | 5,187,162.3 μs | 20.43 | 0.34 | | RemoveUsingMoveStrategy | .NET 4.8 | 10000000 | 2000 | Random | 65,365.4 μs | 422.41 μs | 592.16 μs | 65,307.3 μs | 0.26 | 0.00 | | Answer63496225_TheodorZoulias | .NET 4.8 | 10000000 | 2000 | Random | 240,584.4 μs | 3,687.89 μs | 5,048.03 μs | 244,336.1 μs | 0.95 | 0.02 | | Answer63496768_CoolBots | .NET 4.8 | 10000000 | 2000 | Random | 54,168.4 μs | 1,001.37 μs | 1,436.14 μs | 53,390.3 μs | 0.21 | 0.01 | | Answer63495657_NetMage | .NET 4.8 | 10000000 | 2000 | Random | 72,501.4 μs | 452.46 μs | 634.29 μs | 72,161.2 μs | 0.29 | 0.00 | | Answer63496256_Vernou | .NET 4.8 | 10000000 | 2000 | Random | 5,814.0 μs | 89.71 μs | 128.67 μs | 5,825.3 μs | 0.02 | 0.00 | | | | | | | | | | | | | | RemoveUsingListCopy | .NET Core 3.1 | 10000000 | 2000 | Random | 239,784.0 μs | 2,721.35 μs | 3,902.88 μs | 241,125.5 μs | 1.00 | 0.00 | | RemoveUsingRemoveAt | .NET Core 3.1 | 10000000 | 2000 | Random | 5,538,337.5 μs | 353,505.30 μs | 495,565.06 μs | 5,208,226.1 μs | 23.12 | 2.15 | | RemoveUsingMoveStrategy | .NET Core 3.1 | 10000000 | 2000 | Random | 33,071.8 μs | 103.80 μs | 138.57 μs | 33,030.5 μs | 0.14 | 0.00 | | Answer63496225_TheodorZoulias | .NET Core 3.1 | 10000000 | 2000 | Random | 240,825.5 μs | 851.49 μs | 1,248.11 μs | 240,520.9 μs | 1.00 | 0.02 | | Answer63496768_CoolBots | .NET Core 3.1 | 10000000 | 2000 | Random | 26,265.0 μs | 90.76 μs | 124.23 μs | 26,253.0 μs | 0.11 | 0.00 | | Answer63495657_NetMage | .NET Core 3.1 | 10000000 | 2000 | Random | 48,670.6 μs | 581.51 μs | 833.99 μs | 48,303.0 μs | 0.20 | 0.00 | | Answer63496256_Vernou | .NET Core 3.1 | 10000000 | 2000 | Random | 5,905.5 μs | 96.27 μs | 131.78 μs | 5,915.1 μs | 0.02 | 0.00 | | | | | | | | | | | | | | RemoveUsingListCopy | .NET 4.8 | 10000000 | 2000 | Middle | 153,776.2 μs | 2,454.90 μs | 3,674.38 μs | 152,872.0 μs | 1.00 | 0.00 | | RemoveUsingRemoveAt | .NET 4.8 | 10000000 | 2000 | Middle | 5,245,952.0 μs | 13,845.58 μs | 20,294.67 μs | 5,252,922.4 μs | 34.10 | 0.81 | | RemoveUsingMoveStrategy | .NET 4.8 | 10000000 | 2000 | Middle | 33,233.6 μs | 110.33 μs | 158.24 μs | 33,217.3 μs | 0.22 | 0.01 | | Answer63496225_TheodorZoulias | .NET 4.8 | 10000000 | 2000 | Middle | 128,949.8 μs | 560.72 μs | 804.17 μs | 128,724.9 μs | 0.84 | 0.02 | | Answer63496768_CoolBots | .NET 4.8 | 10000000 | 2000 | Middle | 48,965.1 μs | 70.75 μs | 94.45 μs | 48,957.3 μs | 0.32 | 0.01 | | Answer63495657_NetMage | .NET 4.8 | 10000000 | 2000 | Middle | 32,641.5 μs | 66.85 μs | 91.51 μs | 32,610.0 μs | 0.21 | 0.01 | | Answer63496256_Vernou | .NET 4.8 | 10000000 | 2000 | Middle | 2,982.2 μs | 29.47 μs | 41.31 μs | 2,961.9 μs | 0.02 | 0.00 | | | | | | | | | | | | | | RemoveUsingListCopy | .NET Core 3.1 | 10000000 | 2000 | Middle | 144,208.7 μs | 2,035.88 μs | 2,984.16 μs | 142,693.2 μs | 1.00 | 0.00 | | RemoveUsingRemoveAt | .NET Core 3.1 | 10000000 | 2000 | Middle | 5,235,957.7 μs | 13,674.19 μs | 20,043.46 μs | 5,241,536.1 μs | 36.32 | 0.78 | | RemoveUsingMoveStrategy | .NET Core 3.1 | 10000000 | 2000 | Middle | 16,547.3 μs | 72.72 μs | 101.95 μs | 16,520.7 μs | 0.11 | 0.00 | | Answer63496225_TheodorZoulias | .NET Core 3.1 | 10000000 | 2000 | Middle | 137,218.2 μs | 716.45 μs | 980.69 μs | 137,027.0 μs | 0.95 | 0.02 | | Answer63496768_CoolBots | .NET Core 3.1 | 10000000 | 2000 | Middle | 23,728.5 μs | 79.84 μs | 111.93 μs | 23,689.9 μs | 0.16 | 0.00 | | Answer63495657_NetMage | .NET Core 3.1 | 10000000 | 2000 | Middle | 17,298.3 μs | 216.46 μs | 310.44 μs | 17,165.5 μs | 0.12 | 0.00 | | Answer63496256_Vernou | .NET Core 3.1 | 10000000 | 2000 | Middle | 2,999.7 μs | 85.78 μs | 123.03 μs | 2,957.1 μs | 0.02 | 0.00 | [snip]
要查看各种实现,请向下滚动三分之一并查找用[Benchmark()]
装饰的方法。
using System;
using System.Collections.Generic;
using System.Linq;
using System.Reflection;
using BenchmarkDotNet.Attributes;
namespace SO63495264
{
public class Benchmarks
{
public enum RemovalLocation
{
Random,
Beginning,
Middle,
End
}
[Params(10_000, 10_000_000)]
public int DataListSize
{
get; set;
}
[Params(2_000)]
public int RemovalListSize
{
get; set;
}
//[ParamsAllValues()]
[Params(RemovalLocation.Random, RemovalLocation.Middle)]
public RemovalLocation RemovalListLocation
{
get; set;
}
public List<int> DataList
{
get;
set;
}
public IReadOnlyList<int> RemovalList
{
get;
set;
}
[GlobalSetup()]
public void GlobalSetup()
{
IEnumerable<int> removalIndices;
if (RemovalListLocation == RemovalLocation.Random)
{
// Pass a constant seed so the same indices get generated for every run
Random random = new Random(12345);
removalIndices = Enumerable.Range(0, RemovalListSize)
.Select(i => random.Next(DataListSize));
}
else
{
int removalSegmentOffset = RemovalListLocation switch {
RemovalLocation.Beginning => 0,
RemovalLocation.Middle => (DataListSize - RemovalListSize) / 2,
RemovalLocation.End => DataListSize - RemovalListSize,
_ => throw new Exception($"Unexpected {nameof(RemovalLocation)} enumeration value {RemovalListLocation}.")
};
removalIndices = Enumerable.Range(removalSegmentOffset, RemovalListSize);
}
// For efficiency, create a single List<int> to be reused by all iterations for a given set of parameters
DataList = new List<int>(DataListSize);
RemovalList = removalIndices.ToList().AsReadOnly();
}
[IterationSetup()]
public void IterationSetup()
{
// Each iteration could modify DataList, so repopulate its elements for the
// next iteration. DataList is either new or has been Clear()ed by IterationCleanup().
for (int i = 0; i < DataListSize; i++)
DataList.Add(i);
}
[IterationCleanup()]
public void IterationCleanup()
{
DataList.Clear();
}
[GlobalCleanup()]
public void GlobalCleanup()
{
// Force collection of the List<> for the current set of parameters
int generation = GC.GetGeneration(DataList);
DataList = null;
GC.Collect(generation, GCCollectionMode.Forced, true);
}
[Benchmark(Baseline = true)]
public List<int> RemoveUsingListCopy()
{
HashSet<int> removalSet = RemovalList.ToHashSet();
List<int> newList = new List<int>(DataList.Count - removalSet.Count);
for (int index = 0; index < DataList.Count; index++)
if (!removalSet.Contains(index))
newList.Add(DataList[index]);
return newList;
}
// Based on Stack Overflow question 63495264
// https://stackoverflow.com/q/63495264/150605
[Benchmark()]
public List<int> RemoveUsingRemoveAt()
{
// The collection of indices to remove must be in decreasing order
// with no duplicates; all are assumed to be in-range for DataList
foreach (int index in RemovalList.Distinct().OrderByDescending(i => i))
DataList.RemoveAt(index);
return DataList;
}
// From Stack Overflow answer 63498191
// https://stackoverflow.com/a/63498191/150605
[Benchmark()]
public List<int> RemoveUsingMoveStrategy()
{
// The collection of indices to remove must be in increasing order
// with no duplicates; all are assumed to be in-range for DataList
int[] coreIndices = RemovalList.Distinct().OrderBy(i => i).ToArray();
List<(int startIndex, int count, int distance)> moveOperations
= new List<(int, int, int)>();
int candidateIndex;
for (int i = 0; i < coreIndices.Length; i = candidateIndex)
{
int currentIndex = coreIndices[i];
int elementCount = 1;
// Merge removal of consecutive indices into one operation
while ((candidateIndex = i + elementCount) < coreIndices.Length
&& coreIndices[candidateIndex] == currentIndex + elementCount
)
{
elementCount++;
}
int nextIndex = candidateIndex < coreIndices.Length
? coreIndices[candidateIndex]
: DataList.Count;
int moveCount = nextIndex - currentIndex - elementCount;
// Removing one or more elements from the end of the
// list will result in a no-op, 0-length move; omit it
if (moveCount > 0)
{
moveOperations.Add(
(
startIndex: currentIndex + elementCount,
count: moveCount,
distance: i + elementCount
)
);
}
}
// Perform the element moves
foreach ((int startIndex, int count, int distance) in moveOperations)
{
for (int offset = 0; offset < count; offset++)
{
int sourceIndex = startIndex + offset;
int targetIndex = sourceIndex - distance;
DataList[targetIndex] = DataList[sourceIndex];
}
}
// "Trim" the number of removed elements from the end of the list
DataList.RemoveRange(DataList.Count - coreIndices.Length, coreIndices.Length);
return DataList;
}
// Adapted from Stack Overflow answer 63496225 revision 4
// https://stackoverflow.com/revisions/63496225/4
[Benchmark()]
public List<int> Answer63496225_TheodorZoulias()
{
HashSet<int> indicesSet = new HashSet<int>(RemovalList);
int index = 0;
DataList.RemoveAll(_ => indicesSet.Contains(index++));
return DataList;
}
// Adapted from Stack Overflow answer 63496768 revision 2
// https://stackoverflow.com/revisions/63496768/2
[Benchmark()]
public List<int> Answer63496768_CoolBots()
{
List<int> sortedIndicies = RemovalList.Distinct().OrderBy(i => i).ToList();
int sourceStartIndex = 0;
int destStartIndex = 0;
int spanLength = 0;
int skipCount = 0;
// Copy items up to last index to be skipped
foreach (int skipIndex in sortedIndicies)
{
spanLength = skipIndex - sourceStartIndex;
destStartIndex = sourceStartIndex - skipCount;
for (int i = sourceStartIndex; i < sourceStartIndex + spanLength; i++)
{
DataList[destStartIndex] = DataList[i];
destStartIndex++;
}
sourceStartIndex = skipIndex + 1;
skipCount++;
}
// Copy remaining items (between last index to be skipped and end of list)
spanLength = DataList.Count - sourceStartIndex;
destStartIndex = sourceStartIndex - skipCount;
for (int i = sourceStartIndex; i < sourceStartIndex + spanLength; i++)
{
DataList[destStartIndex] = DataList[i];
destStartIndex++;
}
DataList.RemoveRange(destStartIndex, sortedIndicies.Count);
return DataList;
}
// Adapted from Stack Overflow answer 63495657 revision 6
// https://stackoverflow.com/revisions/63495657/6
[Benchmark()]
public List<int> Answer63495657_NetMage()
{
List<int> removeAtList = RemovalList.Distinct().OrderBy(i => i).ToList();
int srcCount = DataList.Count;
int ralCount = removeAtList.Count;
int removeAtIndice = 1;
int freeIndex = removeAtList[0];
int current = freeIndex + 1;
while (current < srcCount)
{
while (removeAtIndice < ralCount && current == removeAtList[removeAtIndice])
{
++current;
++removeAtIndice;
}
if (current < srcCount)
DataList[freeIndex++] = DataList[current++];
}
DataList.RemoveRange(freeIndex, srcCount - freeIndex);
return DataList;
}
// Adapted from Stack Overflow answer 63496256 revision 3
// https://stackoverflow.com/revisions/63496256/3
[Benchmark()]
public List<int> Answer63496256_Vernou()
{
List<int> indices = RemovalList.Distinct().OrderBy(i => i).ToList();
//Get the internal array
int[] largeArray = (int[]) typeof(List<int>)
.GetField("_items", BindingFlags.Instance | BindingFlags.NonPublic)
.GetValue(DataList);
int current = 0;
int copyFrom = 0;
for (int i = 0; i < indices.Count; i++)
{
int copyTo = indices[i];
if (copyTo < copyFrom)
{
//In case the indice is duplicate,
//The item is already passed
continue;
}
int copyLength = copyTo - copyFrom;
Array.Copy(largeArray, copyFrom, largeArray, current, copyLength);
current += copyLength;
copyFrom = copyTo + 1;
}
//Resize the internal array
DataList.RemoveRange(DataList.Count - indices.Count, indices.Count);
return DataList;
}
}
}
RunBenchmark()
定义要运行的基准测试作业的类型以及运行时。
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Reflection;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Environments;
using BenchmarkDotNet.Jobs;
namespace SO63495264
{
class Program
{
static void Main(string[] args)
{
if (args?.Length == 0)
{
string assemblyFilePath = Assembly.GetExecutingAssembly().Location;
string assemblyFileName = System.IO.Path.GetFileName(assemblyFilePath);
Console.WriteLine($"{assemblyFileName} {{ benchmark | test }}");
}
else if (string.Equals(args[0], "benchmark", StringComparison.OrdinalIgnoreCase))
RunBenchmark();
else if (string.Equals(args[0], "test", StringComparison.OrdinalIgnoreCase))
RunTest();
else
Console.WriteLine($"Unexpected parameter \"{args[0]}\"");
}
static void RunBenchmark()
{
Job baseJob = Job.MediumRun;
IConfig config = DefaultConfig.Instance;
foreach (Runtime runtime in new Runtime[] { CoreRuntime.Core31, ClrRuntime.Net48 })
{
config = config.AddJob(
baseJob.WithRuntime(runtime)
);
}
BenchmarkDotNet.Running.BenchmarkRunner.Run<Benchmarks>(config);
}
static void RunTest()
{
const int ListSize = 20;
const int MaxDisplayElements = 20;
IEnumerable<int> data = Enumerable.Range(0, ListSize);
IReadOnlyList<int> indices = new List<int>(
new int[] {
0, 1, // First two indices
ListSize / 4,
ListSize / 2, ListSize / 2 + 1, // Consecutive indices
ListSize / 4 * 3, ListSize / 4 * 3, // Duplicate indices
ListSize - 2, ListSize - 1 // Last two indices
}
).AsReadOnly();
// Discover and invoke the benchmark methods the same way BenchmarkDotNet would
Benchmarks benchmarks = new Benchmarks() {
RemovalList = indices
};
IEnumerable<MethodInfo> benchmarkMethods = benchmarks.GetType()
.GetMethods(BindingFlags.Instance | BindingFlags.Public)
.Where(
method => method.CustomAttributes.Any(
attribute => attribute.AttributeType == typeof(BenchmarkAttribute)
)
);
// Call a known-good method to get the correct results for comparison
benchmarks.DataList = data.ToList();
List<int> controlList = benchmarks.RemoveUsingListCopy();
foreach (MethodInfo benchmarkMethod in benchmarkMethods)
{
List<int> inputList = data.ToList();
benchmarks.DataList = inputList;
Stopwatch watch = Stopwatch.StartNew();
List<int> outputList = (List<int>) benchmarkMethod.Invoke(benchmarks, Array.Empty<object>());
watch.Stop();
Console.WriteLine($"{benchmarkMethod.Name}():");
Console.WriteLine($"\tElements: {{ {string.Join(", ", outputList.Take(MaxDisplayElements))}{(outputList.Count > MaxDisplayElements ? ", ..." : "") } }}");
Console.WriteLine($"\t Count: {outputList.Count:N0}");
Console.WriteLine($"\tSequence: {(outputList.SequenceEqual(controlList) ? "E" : "*Not* e")}qual to control list");
Console.WriteLine($"\tDuration: {watch.Elapsed.TotalMilliseconds:N2} milliseconds");
Console.WriteLine($"\tReturned: {(object.ReferenceEquals(outputList, inputList) ? "Input list" : "New list")}");
}
}
}
}
为了对 .NET Framework (4.8) 进行基准测试,我必须将以下属性添加到我的 .NET Core .csproj
项目文件中:
<TargetFrameworks>netcoreapp3.1;net48</TargetFrameworks>
<LangVersion>8.0</LangVersion>
剩余41 个字符!
由于源列表的顺序很重要,您可以将每个项目向下移动列表,跳过要删除的索引,然后删除列表的末尾。
更新:获取RemoveAll
的 .Net Core 源代码并将其修改为索引列表而不是谓词。
更新 2:优化为尽可能不重复测试。
更新 3:简化为具有额外代码的优化在基准测试中被证明较慢。
将src
作为大列表,并将removeAtList
作为要以某种随机顺序删除的索引,您可以执行以下操作:
removeAtList.Sort();
var srcCount = src.Count;
var ralCount = removeAtList.Count;
var removeAtIndice = 1;
var freeIndex = removeAtList[0];
var current = freeIndex+1;
while (current < srcCount) {
while (removeAtIndice < ralCount && current == removeAtList[removeAtIndice]) {
++current;
++removeAtIndice;
}
if (current < srcCount)
src[freeIndex++] = src[current++];
}
src.RemoveRange(freeIndex, srcCount-freeIndex);
对于 10 亿个随机整数元素列表和 1000 - 3000 个随机索引元素列表,我使用此算法每次删除得到 1.1 毫秒。 使用RemoveAt
,我每次删除的时间超过 232.77 毫秒,所以这快了大约 200 倍。
允许并行化的一种方法是将列表分成多个片段; 也许最初(任意)分离 100 万个元素的板。 只要每个 slab 维护自己的计数,您就可以按索引将工作拆分为从不同的 slab 中移除(完全基于计数),然后并发执行实际的移除工作。 如果你在每个中都留有一些备用容量,你还可以更便宜地向中间添加元素,因为你通常只接触一个 slab。 随机访问会稍微慢一些,因为您可能需要查看多个 slab 计数以确定正确的 slab,但是如果 slab 计数保持在一个连续的向量中(而不是针对每个 slab),您应该有出色的 memory 缓存命中在做的时候。
这个答案是基于这里的其他答案——主要是,我在列表中向上移动元素,正如@Vernou(在他们的答案中)和@BACON(在评论中)所建议的那样。 这个最终是高性能的(不像我的前几种方法),并且比迄今为止发布的其他解决方案更快,至少在我的测试中 - 我尝试了 OP 的2_000_000_000
条目和2_000
设置 - 我的笔记本电脑上的运行时间不到 10 秒(i7 -8550U @ 1.8GHz,16GB 内存):
static void FilterOutIndicies(List<int> values, List<int> sortedIndicies)
{
int sourceStartIndex = 0;
int destStartIndex = 0;
int spanLength = 0;
int skipCount = 0;
// Copy items up to last index to be skipped
foreach (var skipIndex in sortedIndicies)
{
spanLength = skipIndex - sourceStartIndex;
destStartIndex = sourceStartIndex - skipCount;
for (int i = sourceStartIndex; i < sourceStartIndex + spanLength; i++)
{
values[destStartIndex] = values[i];
destStartIndex++;
}
sourceStartIndex = skipIndex + 1;
skipCount++;
}
// Copy remaining items (between last index to be skipped and end of list)
spanLength = values.Count - sourceStartIndex;
destStartIndex = sourceStartIndex - skipCount;
for (int i = sourceStartIndex; i < sourceStartIndex + spanLength; i++)
{
values[destStartIndex] = values[i];
destStartIndex++;
}
values.RemoveRange(destStartIndex, sortedIndicies.Count);
}
List.RemoveAt
方法从删除的项目中复制所有下一个项目。 在您的情况下,此副本每个项目 2,000 * 2,000,000,000 次(不是真的,但接近真实)。
一种解决方案是在删除的项目和下一个删除的项目之间手动复制项目:
static void Main(string[] args)
{
var largeList = Enumerable.Range(0, 2_000_000_000).ToList();
var indices = new List<int>();
var rand = new Random();
for (var i = 0; i < 20000; i++)
{
indices.Add(rand.Next(0, largeList.Count - 1));
}
indices.Sort();
var watch = new Stopwatch();
watch.Start();
// You can convert the list to array with ToArray,
// but this duplicate the memory use.
// Or get the internal array by reflection,
// but reflection on external library isn't recommended
var largeArray = (int[])typeof(List<int>)
.GetField("_items", BindingFlags.Instance | BindingFlags.NonPublic)
.GetValue(largeList);
var current = 0;
var copyFrom = 0;
for (var i = 0; i < indices.Count; i++)
{
var copyTo = indices[i];
if (copyTo < copyFrom)
{
//In case the indice is duplicate,
//The item is already passed
continue;
}
var copyLength = copyTo - copyFrom;
Array.Copy(largeArray, copyFrom, largeArray, current, copyLength);
current += copyLength;
copyFrom = copyTo + 1;
}
//Resize the internal array
largeList.RemoveRange(largeList.Count - indices.Count, indices.Count);
watch.Stop();
Console.WriteLine(watch.Elapsed);
Console.WriteLine(largeList.Count);
}
当您有多个项目要从List
中删除,并且无法用新List
替换List
时,最有效的方法是使用RemoveAll
方法而不是RemoveAt
。 RemoveAll
仅重新排列List
的内部 state 一次,而不是对每个删除的项目重新排列一次。
RemoveAll
接受一个Predicate<T>
,它将为列表(大列表)中的每个项目调用一次。 不幸的是,这个代表没有收到当前测试项目的索引。 但是,您可能依赖于了解RemoveAll
的实现方式。 源代码显示项目按升序顺序测试。 因此,基于这些知识,您可以使用以下三行非常有效地从列表中删除选定的索引:
var indicesSet = new HashSet<int>(indices);
int index = 0;
largeList.RemoveAll(_ => indicesSet.Contains(index++));
但你真的不应该。 如果 .NET 的未来版本带有RemoveAll
的不同内部实现,则此解决方案将严重崩溃。 因此,将此视为肮脏的 hack,而不是解决问题的生产质量解决方案。
我有一个非常大的整数列表(大约 20 亿个元素)和一个带有索引的列表(几千个元素),我需要从第一个列表中删除元素。 我目前的方法是遍历第二个列表中的所有索引,将每个索引传递给第一个列表的RemoveAt()
方法:
indices.Sort();
indices.Reverse();
for (i = 0; i < indices.Count; i++)
{
largeList.RemoveAt(indices[i]);
}
但是,大约需要 2 分钟才能完成。 我真的需要更快地执行此操作。 有没有办法优化这个?
我有一个 10 核的 Intel i9X CPU,所以也许有某种并行处理方式?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.