简体   繁体   中英

How can I efficiently remove elements by index from a very large list?

I have a very large list of integers (about 2 billion elements) and a list with indices (couple thousand elements) at which I need to remove elements from the first list. My current approach is to loop over all indices in the second list, passing each to the RemoveAt() method of the first list:

indices.Sort();
indices.Reverse();
for (i = 0; i < indices.Count; i++)
{
    largeList.RemoveAt(indices[i]);
}

However, it takes about 2 minutes to finish. I really need to perform this operation much faster. Is there any way of optimizing this?

I have a Intel i9X CPU with 10 cores, so maybe some way of parallel processing?

Each time RemoveAt() is called it has to move every element that comes after the specified index up by one element. If there are thousands of elements to remove in a very large list, that will result in many, many (needless) moves.

My thinking was that if you can calculate the start index and length of each move, you can process the list in a single pass with no overlapping moves. This is done by comparing adjacent elements in the list of indices to remove. While this does mean building a third List<> of move operations to perform, my hope is having the most efficient, minimal move strategy planned out in advance will pay off in the end (or perhaps there's a way to do that without allocating any objects that presently isn't occurring to me).

You can see my implementation, RemoveUsingMoveStrategy() , in the benchmark code below. Running the launcher program below in test mode, we can see it — as well as the other answers — produces the correct result when given indices 0 , 1 , 5 , 10 , 11 , 15 , 15 (repeated), 18 , and 19 to remove from a 20 -element List<int> ...

PS> dotnet run --configuration release --framework netcoreapp3.1 -- test
[snip]
RemoveUsingMoveStrategy():
        Elements: { 2, 3, 4, 6, 7, 8, 9, 12, 13, 14, 16, 17 }
           Count: 12
        Sequence: Equal to control list
        Duration: 3.95 milliseconds
        Returned: Input list
[snip]

Benchmark notes

  • It was my intention to benchmark against a billion-element List<int> , but RemoveUsingRemoveAt() — inspired by the code in the question — is so inefficient that it was taking too long, so I only went up to 10 million elements.

  • I still hope to run a billion-element benchmark that excludes RemoveUsingRemoveAt() . For that reason, I introduced a not-quite-as-naive implementation called RemoveUsingListCopy() to serve as the baseline for comparison across all list sizes. As the name implies, it doesn't modify the input list but creates a new one with the removals applied.

  • Since not all implementations require the list of indices to remove to be unique and/or sorted, I figured the fairest way to benchmark would be to include that overhead in the benchmark time where the method requires it.

  • The only other changes I made to code from other answers was to make use of the benchmark's lists ( DataList and RemovalList ) and to change from var to explicit types for reader clarity.

  • RemovalListLocation indicates from where in DataList indices were removed.

    • For Beginning , Middle , and End it is a contiguous RemovalListSize -sized block removed from that location.
    • For Random it is RemovalListSize random, valid, unsorted, not-guaranteed-to-be-unique indices generated from a constant seed.

    To keep the results short(er), I opted to only benchmark the Middle — thinking it'd be a good middle -of-the-road compromise — and Random values.

Benchmark results

  • RemoveUsingRemoveAt() is terrible . Don't do that.

  • For the smaller list, RemoveUsingListCopy() is consistently the fastest, at the expense of doubling memory usage.

  • For the larger list, Vernou's answer is as least 5x as fast as the other answers, at the expense of relying on implementation details.

    This just goes to show that you shouldn't necessarily always favor List<> over an array — unless you need its additional capabilities — because it isn't always superior. It insulates the underlying array, preventing you (without reflection) from using more performant access methods like Array.Copy() and unsafe code on it.

  • Theodor Zoulias's answer does well with the smaller list, but I think having to "touch" all 10 million indices — each with a call to HashSet<>.Contains() and increment of an index variable — really hurt it with the larger list.

  • The remaining implementations (including mine) have roughly the same performance: not the best, but still pretty good.

Benchmark data

Running the launcher program defined later in this answer in benchmark mode, I get these results from BenchmarkDotNet ...

PS> dotnet run --configuration release --framework netcoreapp3.1 -- benchmark
[snip]
// * Summary *

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.450 (2004/?/20H1)
Intel Core i7 CPU 860 2.80GHz (Nehalem), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.1.401
  [Host]    : .NET Core 3.1.7 (CoreCLR 4.700.20.36602, CoreFX 4.700.20.37001), X64 RyuJIT
  MediumRun : .NET Framework 4.8 (4.8.4200.0), X64 RyuJIT

Job=MediumRun  InvocationCount=1  IterationCount=15  
LaunchCount=2  UnrollFactor=1  WarmupCount=10  

|                        Method |       Runtime | DataListSize | RemovalListSize | RemovalListLocation |           Mean |         Error |        StdDev |         Median | Ratio | RatioSD |
|------------------------------ |-------------- |------------- |---------------- |-------------------- |---------------:|--------------:|--------------:|---------------:|------:|--------:|
|           RemoveUsingListCopy |      .NET 4.8 |        10000 |            2000 |              Random |       379.9 μs |      15.93 μs |      23.36 μs |       380.4 μs |  1.00 |    0.00 |
|           RemoveUsingRemoveAt |      .NET 4.8 |        10000 |            2000 |              Random |     1,463.7 μs |      93.79 μs |     137.47 μs |     1,434.7 μs |  3.87 |    0.41 |
|       RemoveUsingMoveStrategy |      .NET 4.8 |        10000 |            2000 |              Random |       635.9 μs |      19.77 μs |      29.60 μs |       624.0 μs |  1.67 |    0.14 |
| Answer63496225_TheodorZoulias |      .NET 4.8 |        10000 |            2000 |              Random |       372.1 μs |      11.52 μs |      16.15 μs |       373.8 μs |  0.99 |    0.07 |
|       Answer63496768_CoolBots |      .NET 4.8 |        10000 |            2000 |              Random |       594.5 μs |      25.13 μs |      36.03 μs |       593.1 μs |  1.57 |    0.12 |
|        Answer63495657_NetMage |      .NET 4.8 |        10000 |            2000 |              Random |       618.8 μs |      17.53 μs |      23.99 μs |       622.4 μs |  1.65 |    0.13 |
|         Answer63496256_Vernou |      .NET 4.8 |        10000 |            2000 |              Random |       645.6 μs |      27.28 μs |      39.99 μs |       632.2 μs |  1.71 |    0.16 |
|                               |               |              |                 |                     |                |               |               |                |       |         |
|           RemoveUsingListCopy | .NET Core 3.1 |        10000 |            2000 |              Random |       391.5 μs |      10.39 μs |      15.55 μs |       391.6 μs |  1.00 |    0.00 |
|           RemoveUsingRemoveAt | .NET Core 3.1 |        10000 |            2000 |              Random |     1,402.2 μs |      44.20 μs |      64.80 μs |     1,407.6 μs |  3.59 |    0.21 |
|       RemoveUsingMoveStrategy | .NET Core 3.1 |        10000 |            2000 |              Random |       557.9 μs |      19.73 μs |      27.00 μs |       557.2 μs |  1.43 |    0.10 |
| Answer63496225_TheodorZoulias | .NET Core 3.1 |        10000 |            2000 |              Random |       424.3 μs |      20.90 μs |      29.30 μs |       424.2 μs |  1.09 |    0.09 |
|       Answer63496768_CoolBots | .NET Core 3.1 |        10000 |            2000 |              Random |       535.0 μs |      19.37 μs |      27.16 μs |       537.1 μs |  1.37 |    0.08 |
|        Answer63495657_NetMage | .NET Core 3.1 |        10000 |            2000 |              Random |       557.7 μs |      18.73 μs |      25.63 μs |       550.0 μs |  1.43 |    0.09 |
|         Answer63496256_Vernou | .NET Core 3.1 |        10000 |            2000 |              Random |       554.2 μs |      13.82 μs |      18.45 μs |       554.0 μs |  1.42 |    0.07 |
|                               |               |              |                 |                     |                |               |               |                |       |         |
|           RemoveUsingListCopy |      .NET 4.8 |        10000 |            2000 |              Middle |       221.6 μs |       7.25 μs |      10.63 μs |       222.5 μs |  1.00 |    0.00 |
|           RemoveUsingRemoveAt |      .NET 4.8 |        10000 |            2000 |              Middle |     1,195.3 μs |      20.01 μs |      28.69 μs |     1,187.7 μs |  5.42 |    0.30 |
|       RemoveUsingMoveStrategy |      .NET 4.8 |        10000 |            2000 |              Middle |       405.0 μs |      13.65 μs |      19.14 μs |       410.7 μs |  1.83 |    0.10 |
| Answer63496225_TheodorZoulias |      .NET 4.8 |        10000 |            2000 |              Middle |       206.3 μs |       8.62 μs |      12.09 μs |       204.9 μs |  0.94 |    0.08 |
|       Answer63496768_CoolBots |      .NET 4.8 |        10000 |            2000 |              Middle |       427.5 μs |      15.56 μs |      22.81 μs |       435.4 μs |  1.93 |    0.13 |
|        Answer63495657_NetMage |      .NET 4.8 |        10000 |            2000 |              Middle |       405.4 μs |      13.80 μs |      19.35 μs |       403.8 μs |  1.84 |    0.11 |
|         Answer63496256_Vernou |      .NET 4.8 |        10000 |            2000 |              Middle |       413.9 μs |      15.26 μs |      20.89 μs |       419.8 μs |  1.87 |    0.12 |
|                               |               |              |                 |                     |                |               |               |                |       |         |
|           RemoveUsingListCopy | .NET Core 3.1 |        10000 |            2000 |              Middle |       235.2 μs |      10.73 μs |      15.73 μs |       236.2 μs |  1.00 |    0.00 |
|           RemoveUsingRemoveAt | .NET Core 3.1 |        10000 |            2000 |              Middle |     1,345.6 μs |      32.07 μs |      43.90 μs |     1,352.7 μs |  5.77 |    0.41 |
|       RemoveUsingMoveStrategy | .NET Core 3.1 |        10000 |            2000 |              Middle |       324.0 μs |       4.92 μs |       7.05 μs |       326.6 μs |  1.39 |    0.09 |
| Answer63496225_TheodorZoulias | .NET Core 3.1 |        10000 |            2000 |              Middle |       262.9 μs |       6.18 μs |       9.06 μs |       265.4 μs |  1.12 |    0.08 |
|       Answer63496768_CoolBots | .NET Core 3.1 |        10000 |            2000 |              Middle |       333.6 μs |      10.14 μs |      13.87 μs |       340.9 μs |  1.43 |    0.11 |
|        Answer63495657_NetMage | .NET Core 3.1 |        10000 |            2000 |              Middle |       313.5 μs |       9.05 μs |      12.69 μs |       310.5 μs |  1.34 |    0.11 |
|         Answer63496256_Vernou | .NET Core 3.1 |        10000 |            2000 |              Middle |       332.3 μs |       6.70 μs |       8.95 μs |       331.9 μs |  1.43 |    0.09 |
|                               |               |              |                 |                     |                |               |               |                |       |         |
|           RemoveUsingListCopy |      .NET 4.8 |     10000000 |            2000 |              Random |   253,977.1 μs |   2,721.70 μs |   3,989.43 μs |   253,809.0 μs |  1.00 |    0.00 |
|           RemoveUsingRemoveAt |      .NET 4.8 |     10000000 |            2000 |              Random | 5,191,083.4 μs |  13,200.66 μs |  18,931.99 μs | 5,187,162.3 μs | 20.43 |    0.34 |
|       RemoveUsingMoveStrategy |      .NET 4.8 |     10000000 |            2000 |              Random |    65,365.4 μs |     422.41 μs |     592.16 μs |    65,307.3 μs |  0.26 |    0.00 |
| Answer63496225_TheodorZoulias |      .NET 4.8 |     10000000 |            2000 |              Random |   240,584.4 μs |   3,687.89 μs |   5,048.03 μs |   244,336.1 μs |  0.95 |    0.02 |
|       Answer63496768_CoolBots |      .NET 4.8 |     10000000 |            2000 |              Random |    54,168.4 μs |   1,001.37 μs |   1,436.14 μs |    53,390.3 μs |  0.21 |    0.01 |
|        Answer63495657_NetMage |      .NET 4.8 |     10000000 |            2000 |              Random |    72,501.4 μs |     452.46 μs |     634.29 μs |    72,161.2 μs |  0.29 |    0.00 |
|         Answer63496256_Vernou |      .NET 4.8 |     10000000 |            2000 |              Random |     5,814.0 μs |      89.71 μs |     128.67 μs |     5,825.3 μs |  0.02 |    0.00 |
|                               |               |              |                 |                     |                |               |               |                |       |         |
|           RemoveUsingListCopy | .NET Core 3.1 |     10000000 |            2000 |              Random |   239,784.0 μs |   2,721.35 μs |   3,902.88 μs |   241,125.5 μs |  1.00 |    0.00 |
|           RemoveUsingRemoveAt | .NET Core 3.1 |     10000000 |            2000 |              Random | 5,538,337.5 μs | 353,505.30 μs | 495,565.06 μs | 5,208,226.1 μs | 23.12 |    2.15 |
|       RemoveUsingMoveStrategy | .NET Core 3.1 |     10000000 |            2000 |              Random |    33,071.8 μs |     103.80 μs |     138.57 μs |    33,030.5 μs |  0.14 |    0.00 |
| Answer63496225_TheodorZoulias | .NET Core 3.1 |     10000000 |            2000 |              Random |   240,825.5 μs |     851.49 μs |   1,248.11 μs |   240,520.9 μs |  1.00 |    0.02 |
|       Answer63496768_CoolBots | .NET Core 3.1 |     10000000 |            2000 |              Random |    26,265.0 μs |      90.76 μs |     124.23 μs |    26,253.0 μs |  0.11 |    0.00 |
|        Answer63495657_NetMage | .NET Core 3.1 |     10000000 |            2000 |              Random |    48,670.6 μs |     581.51 μs |     833.99 μs |    48,303.0 μs |  0.20 |    0.00 |
|         Answer63496256_Vernou | .NET Core 3.1 |     10000000 |            2000 |              Random |     5,905.5 μs |      96.27 μs |     131.78 μs |     5,915.1 μs |  0.02 |    0.00 |
|                               |               |              |                 |                     |                |               |               |                |       |         |
|           RemoveUsingListCopy |      .NET 4.8 |     10000000 |            2000 |              Middle |   153,776.2 μs |   2,454.90 μs |   3,674.38 μs |   152,872.0 μs |  1.00 |    0.00 |
|           RemoveUsingRemoveAt |      .NET 4.8 |     10000000 |            2000 |              Middle | 5,245,952.0 μs |  13,845.58 μs |  20,294.67 μs | 5,252,922.4 μs | 34.10 |    0.81 |
|       RemoveUsingMoveStrategy |      .NET 4.8 |     10000000 |            2000 |              Middle |    33,233.6 μs |     110.33 μs |     158.24 μs |    33,217.3 μs |  0.22 |    0.01 |
| Answer63496225_TheodorZoulias |      .NET 4.8 |     10000000 |            2000 |              Middle |   128,949.8 μs |     560.72 μs |     804.17 μs |   128,724.9 μs |  0.84 |    0.02 |
|       Answer63496768_CoolBots |      .NET 4.8 |     10000000 |            2000 |              Middle |    48,965.1 μs |      70.75 μs |      94.45 μs |    48,957.3 μs |  0.32 |    0.01 |
|        Answer63495657_NetMage |      .NET 4.8 |     10000000 |            2000 |              Middle |    32,641.5 μs |      66.85 μs |      91.51 μs |    32,610.0 μs |  0.21 |    0.01 |
|         Answer63496256_Vernou |      .NET 4.8 |     10000000 |            2000 |              Middle |     2,982.2 μs |      29.47 μs |      41.31 μs |     2,961.9 μs |  0.02 |    0.00 |
|                               |               |              |                 |                     |                |               |               |                |       |         |
|           RemoveUsingListCopy | .NET Core 3.1 |     10000000 |            2000 |              Middle |   144,208.7 μs |   2,035.88 μs |   2,984.16 μs |   142,693.2 μs |  1.00 |    0.00 |
|           RemoveUsingRemoveAt | .NET Core 3.1 |     10000000 |            2000 |              Middle | 5,235,957.7 μs |  13,674.19 μs |  20,043.46 μs | 5,241,536.1 μs | 36.32 |    0.78 |
|       RemoveUsingMoveStrategy | .NET Core 3.1 |     10000000 |            2000 |              Middle |    16,547.3 μs |      72.72 μs |     101.95 μs |    16,520.7 μs |  0.11 |    0.00 |
| Answer63496225_TheodorZoulias | .NET Core 3.1 |     10000000 |            2000 |              Middle |   137,218.2 μs |     716.45 μs |     980.69 μs |   137,027.0 μs |  0.95 |    0.02 |
|       Answer63496768_CoolBots | .NET Core 3.1 |     10000000 |            2000 |              Middle |    23,728.5 μs |      79.84 μs |     111.93 μs |    23,689.9 μs |  0.16 |    0.00 |
|        Answer63495657_NetMage | .NET Core 3.1 |     10000000 |            2000 |              Middle |    17,298.3 μs |     216.46 μs |     310.44 μs |    17,165.5 μs |  0.12 |    0.00 |
|         Answer63496256_Vernou | .NET Core 3.1 |     10000000 |            2000 |              Middle |     2,999.7 μs |      85.78 μs |     123.03 μs |     2,957.1 μs |  0.02 |    0.00 |
[snip]

Benchmark code

To see the various implementations, scroll a third of the way down and look for the methods decorated with [Benchmark()] .

This requires the BenchmarkDotNet package .

using System;
using System.Collections.Generic;
using System.Linq;
using System.Reflection;
using BenchmarkDotNet.Attributes;

namespace SO63495264
{
    public class Benchmarks
    {
        public enum RemovalLocation
        {
            Random,
            Beginning,
            Middle,
            End
        }

        [Params(10_000, 10_000_000)]
        public int DataListSize
        {
            get; set;
        }

        [Params(2_000)]
        public int RemovalListSize
        {
            get; set;
        }

        //[ParamsAllValues()]
        [Params(RemovalLocation.Random, RemovalLocation.Middle)]
        public RemovalLocation RemovalListLocation
        {
            get; set;
        }

        public List<int> DataList
        {
            get;
            set;
        }

        public IReadOnlyList<int> RemovalList
        {
            get;
            set;
        }

        [GlobalSetup()]
        public void GlobalSetup()
        {
            IEnumerable<int> removalIndices;

            if (RemovalListLocation == RemovalLocation.Random)
            {
                // Pass a constant seed so the same indices get generated for every run
                Random random = new Random(12345);

                removalIndices = Enumerable.Range(0, RemovalListSize)
                    .Select(i => random.Next(DataListSize));
            }
            else
            {
                int removalSegmentOffset = RemovalListLocation switch {
                    RemovalLocation.Beginning => 0,
                    RemovalLocation.Middle => (DataListSize - RemovalListSize) / 2,
                    RemovalLocation.End => DataListSize - RemovalListSize,
                    _ => throw new Exception($"Unexpected {nameof(RemovalLocation)} enumeration value {RemovalListLocation}.")
                };

                removalIndices = Enumerable.Range(removalSegmentOffset, RemovalListSize);
            }

            // For efficiency, create a single List<int> to be reused by all iterations for a given set of parameters
            DataList = new List<int>(DataListSize);
            RemovalList = removalIndices.ToList().AsReadOnly();
        }

        [IterationSetup()]
        public void IterationSetup()
        {
            // Each iteration could modify DataList, so repopulate its elements for the
            // next iteration.  DataList is either new or has been Clear()ed by IterationCleanup().
            for (int i = 0; i < DataListSize; i++)
                DataList.Add(i);
        }

        [IterationCleanup()]
        public void IterationCleanup()
        {
            DataList.Clear();
        }

        [GlobalCleanup()]
        public void GlobalCleanup()
        {
            // Force collection of the List<> for the current set of parameters
            int generation = GC.GetGeneration(DataList);

            DataList = null;
            GC.Collect(generation, GCCollectionMode.Forced, true);
        }

        [Benchmark(Baseline = true)]
        public List<int> RemoveUsingListCopy()
        {
            HashSet<int> removalSet = RemovalList.ToHashSet();
            List<int> newList = new List<int>(DataList.Count - removalSet.Count);

            for (int index = 0; index < DataList.Count; index++)
                if (!removalSet.Contains(index))
                    newList.Add(DataList[index]);

            return newList;
        }

        // Based on Stack Overflow question 63495264
        // https://stackoverflow.com/q/63495264/150605
        [Benchmark()]
        public List<int> RemoveUsingRemoveAt()
        {
            // The collection of indices to remove must be in decreasing order
            // with no duplicates; all are assumed to be in-range for DataList
            foreach (int index in RemovalList.Distinct().OrderByDescending(i => i))
                DataList.RemoveAt(index);

            return DataList;
        }

        // From Stack Overflow answer 63498191
        // https://stackoverflow.com/a/63498191/150605
        [Benchmark()]
        public List<int> RemoveUsingMoveStrategy()
        {
            // The collection of indices to remove must be in increasing order
            // with no duplicates; all are assumed to be in-range for DataList
            int[] coreIndices = RemovalList.Distinct().OrderBy(i => i).ToArray();
            List<(int startIndex, int count, int distance)> moveOperations
                = new List<(int, int, int)>();

            int candidateIndex;
            for (int i = 0; i < coreIndices.Length; i = candidateIndex)
            {
                int currentIndex = coreIndices[i];
                int elementCount = 1;

                // Merge removal of consecutive indices into one operation
                while ((candidateIndex = i + elementCount) < coreIndices.Length
                    && coreIndices[candidateIndex] == currentIndex + elementCount
                )
                {
                    elementCount++;
                }

                int nextIndex = candidateIndex < coreIndices.Length
                    ? coreIndices[candidateIndex]
                    : DataList.Count;
                int moveCount = nextIndex - currentIndex - elementCount;

                // Removing one or more elements from the end of the
                // list will result in a no-op, 0-length move; omit it
                if (moveCount > 0)
                {
                    moveOperations.Add(
                        (
                            startIndex: currentIndex + elementCount,
                            count: moveCount,
                            distance: i + elementCount
                        )
                    );
                }
            }

            // Perform the element moves
            foreach ((int startIndex, int count, int distance) in moveOperations)
            {
                for (int offset = 0; offset < count; offset++)
                {
                    int sourceIndex = startIndex + offset;
                    int targetIndex = sourceIndex - distance;

                    DataList[targetIndex] = DataList[sourceIndex];
                }
            }

            // "Trim" the number of removed elements from the end of the list
            DataList.RemoveRange(DataList.Count - coreIndices.Length, coreIndices.Length);

            return DataList;
        }

        // Adapted from Stack Overflow answer 63496225 revision 4
        // https://stackoverflow.com/revisions/63496225/4
        [Benchmark()]
        public List<int> Answer63496225_TheodorZoulias()
        {
            HashSet<int> indicesSet = new HashSet<int>(RemovalList);
            int index = 0;

            DataList.RemoveAll(_ => indicesSet.Contains(index++));

            return DataList;
        }


        // Adapted from Stack Overflow answer 63496768 revision 2
        // https://stackoverflow.com/revisions/63496768/2
        [Benchmark()]
        public List<int> Answer63496768_CoolBots()
        {
            List<int> sortedIndicies = RemovalList.Distinct().OrderBy(i => i).ToList();

            int sourceStartIndex = 0;
            int destStartIndex = 0;
            int spanLength = 0;

            int skipCount = 0;

            // Copy items up to last index to be skipped
            foreach (int skipIndex in sortedIndicies)
            {
                spanLength = skipIndex - sourceStartIndex;
                destStartIndex = sourceStartIndex - skipCount;

                for (int i = sourceStartIndex; i < sourceStartIndex + spanLength; i++)
                {
                    DataList[destStartIndex] = DataList[i];
                    destStartIndex++;
                }

                sourceStartIndex = skipIndex + 1;
                skipCount++;
            }

            // Copy remaining items (between last index to be skipped and end of list)
            spanLength = DataList.Count - sourceStartIndex;
            destStartIndex = sourceStartIndex - skipCount;

            for (int i = sourceStartIndex; i < sourceStartIndex + spanLength; i++)
            {
                DataList[destStartIndex] = DataList[i];
                destStartIndex++;
            }

            DataList.RemoveRange(destStartIndex, sortedIndicies.Count);

            return DataList;
        }

        // Adapted from Stack Overflow answer 63495657 revision 6
        // https://stackoverflow.com/revisions/63495657/6
        [Benchmark()]
        public List<int> Answer63495657_NetMage()
        {
            List<int> removeAtList = RemovalList.Distinct().OrderBy(i => i).ToList();

            int srcCount = DataList.Count;
            int ralCount = removeAtList.Count;
            int removeAtIndice = 1;
            int freeIndex = removeAtList[0];
            int current = freeIndex + 1;
            while (current < srcCount)
            {
                while (removeAtIndice < ralCount && current == removeAtList[removeAtIndice])
                {
                    ++current;
                    ++removeAtIndice;
                }

                if (current < srcCount)
                    DataList[freeIndex++] = DataList[current++];
            }

            DataList.RemoveRange(freeIndex, srcCount - freeIndex);

            return DataList;
        }

        // Adapted from Stack Overflow answer 63496256 revision 3
        // https://stackoverflow.com/revisions/63496256/3
        [Benchmark()]
        public List<int> Answer63496256_Vernou()
        {
            List<int> indices = RemovalList.Distinct().OrderBy(i => i).ToList();

            //Get the internal array
            int[] largeArray = (int[]) typeof(List<int>)
                .GetField("_items", BindingFlags.Instance | BindingFlags.NonPublic)
                .GetValue(DataList);

            int current = 0;
            int copyFrom = 0;

            for (int i = 0; i < indices.Count; i++)
            {
                int copyTo = indices[i];
                if (copyTo < copyFrom)
                {
                    //In case the indice is duplicate,
                    //The item is already passed
                    continue;
                }
                int copyLength = copyTo - copyFrom;
                Array.Copy(largeArray, copyFrom, largeArray, current, copyLength);
                current += copyLength;
                copyFrom = copyTo + 1;
            }
            //Resize the internal array
            DataList.RemoveRange(DataList.Count - indices.Count, indices.Count);

            return DataList;
        }
    }
}

Launcher code

RunBenchmark() defines the type of benchmark job to run and for which runtime(s).

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Reflection;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Environments;
using BenchmarkDotNet.Jobs;

namespace SO63495264
{
    class Program
    {
        static void Main(string[] args)
        {
            if (args?.Length == 0)
            {
                string assemblyFilePath = Assembly.GetExecutingAssembly().Location;
                string assemblyFileName = System.IO.Path.GetFileName(assemblyFilePath);

                Console.WriteLine($"{assemblyFileName} {{ benchmark | test }}");
            }
            else if (string.Equals(args[0], "benchmark", StringComparison.OrdinalIgnoreCase))
                RunBenchmark();
            else if (string.Equals(args[0], "test", StringComparison.OrdinalIgnoreCase))
                RunTest();
            else
                Console.WriteLine($"Unexpected parameter \"{args[0]}\"");
        }

        static void RunBenchmark()
        {
            Job baseJob = Job.MediumRun;
            IConfig config = DefaultConfig.Instance;

            foreach (Runtime runtime in new Runtime[] { CoreRuntime.Core31, ClrRuntime.Net48 })
            {
                config = config.AddJob(
                    baseJob.WithRuntime(runtime)
                );
            }

            BenchmarkDotNet.Running.BenchmarkRunner.Run<Benchmarks>(config);
        }

        static void RunTest()
        {
            const int ListSize = 20;
            const int MaxDisplayElements = 20;

            IEnumerable<int> data = Enumerable.Range(0, ListSize);
            IReadOnlyList<int> indices = new List<int>(
                new int[] {
                    0,                               1, // First two indices
                    ListSize / 4,
                    ListSize / 2,     ListSize / 2 + 1, // Consecutive indices
                    ListSize / 4 * 3, ListSize / 4 * 3, // Duplicate indices
                    ListSize - 2,     ListSize - 1      // Last two indices
                }
            ).AsReadOnly();

            // Discover and invoke the benchmark methods the same way BenchmarkDotNet would
            Benchmarks benchmarks = new Benchmarks() {
                RemovalList = indices
            };
            IEnumerable<MethodInfo> benchmarkMethods = benchmarks.GetType()
                .GetMethods(BindingFlags.Instance | BindingFlags.Public)
                .Where(
                    method => method.CustomAttributes.Any(
                        attribute => attribute.AttributeType == typeof(BenchmarkAttribute)
                    )
                );

            // Call a known-good method to get the correct results for comparison
            benchmarks.DataList = data.ToList();
            List<int> controlList = benchmarks.RemoveUsingListCopy();

            foreach (MethodInfo benchmarkMethod in benchmarkMethods)
            {
                List<int> inputList = data.ToList();

                benchmarks.DataList = inputList;
                Stopwatch watch = Stopwatch.StartNew();
                List<int> outputList = (List<int>) benchmarkMethod.Invoke(benchmarks, Array.Empty<object>());
                watch.Stop();

                Console.WriteLine($"{benchmarkMethod.Name}():");
                Console.WriteLine($"\tElements: {{ {string.Join(", ", outputList.Take(MaxDisplayElements))}{(outputList.Count > MaxDisplayElements ? ", ..." : "") } }}");
                Console.WriteLine($"\t   Count: {outputList.Count:N0}");
                Console.WriteLine($"\tSequence: {(outputList.SequenceEqual(controlList) ? "E" : "*Not* e")}qual to control list");
                Console.WriteLine($"\tDuration: {watch.Elapsed.TotalMilliseconds:N2} milliseconds");
                Console.WriteLine($"\tReturned: {(object.ReferenceEquals(outputList, inputList) ? "Input list" : "New list")}");
            }
        }
    }
}

In order to benchmark on .NET Framework (4.8) I had to add the following properties to my .NET Core .csproj project file:

<TargetFrameworks>netcoreapp3.1;net48</TargetFrameworks>
<LangVersion>8.0</LangVersion>

41 characters to spare!

Since the order of the source list is significant, you can move each item down the list, skipping indices to be removed, and then remove the end of the list.

UPDATE: Took the.Net Core source code for RemoveAll and modified it for a indice list instead of a predicate.

UPDATE 2: Optimized to not repeat tests if possible.

UPDATE 3: Simplified as optimization having extra code proved slower in benchmarks.

Having src as the large list, and removeAtList as the indices to remove in some random order, you can do:

removeAtList.Sort();

var srcCount = src.Count;
var ralCount = removeAtList.Count;
var removeAtIndice = 1;
var freeIndex = removeAtList[0];
var current = freeIndex+1;
while (current < srcCount) {
    while (removeAtIndice < ralCount && current == removeAtList[removeAtIndice]) {
        ++current;
        ++removeAtIndice;
    }

    if (current < srcCount)
        src[freeIndex++] = src[current++];
}

src.RemoveRange(freeIndex, srcCount-freeIndex);

For a one billion element list of random integers, and a 1000 - 3000 element list of random indices, I get 1.1 ms per remove with this algorithm. Using RemoveAt , I get over 232.77 ms per remove, so this is about 200 times faster.

One way to allow this to be parallelized would be to break the list into multiple fragments; perhaps initially (arbitrarily) separate slabs of 1 million elements. As long as each slab maintains its own count, you can split the work by index into removals from different slabs (based purely on the counts), and then do the actual removal work concurrently. If you leave some spare capacity in each, you can also add elements into the middle more cheaply, as you are usually only touching one slab. Random access will be a little slower, as you may need to look at multiple slab counts to determine the correct slab, but if the slab counts are maintained in a contiguous vector (rather than against each slab), you should have excellent memory cache hit while doing it.

This answer is based on other answers here - mainly, I am shifting elements up within the list, as suggested by @Vernou (in their answer) and @BACON (in comments). This one is finally performant (unlike my first few approaches), and is faster than other solutions posted so far, at least in my tests - I tried OP's setup of 2_000_000_000 entries and 2_000 indicies - runtime is under 10 seconds on my laptop (i7-8550U @ 1.8GHz, 16GB RAM):

static void FilterOutIndicies(List<int> values, List<int> sortedIndicies)
{
    int sourceStartIndex = 0;
    int destStartIndex = 0;
    int spanLength = 0;

    int skipCount = 0;

    // Copy items up to last index to be skipped
    foreach (var skipIndex in sortedIndicies)
    {
        spanLength = skipIndex - sourceStartIndex;
        destStartIndex = sourceStartIndex - skipCount;

        for (int i = sourceStartIndex; i < sourceStartIndex + spanLength; i++)
        {
            values[destStartIndex] = values[i];
            destStartIndex++;
        }

        sourceStartIndex = skipIndex + 1;
        skipCount++;
    }

    // Copy remaining items (between last index to be skipped and end of list)
    spanLength = values.Count - sourceStartIndex;
    destStartIndex = sourceStartIndex - skipCount;

    for (int i = sourceStartIndex; i < sourceStartIndex + spanLength; i++)
    {
        values[destStartIndex] = values[i];
        destStartIndex++;
    }

    values.RemoveRange(destStartIndex, sortedIndicies.Count);
}

The method List.RemoveAt copy all next items from the removed item. In your case, this copy 2,000 * 2,000,000,000 times each items (not really, but the true is near).

A solution is manually copy item between the removed item and the next removed item:

static void Main(string[] args)
{
    var largeList = Enumerable.Range(0, 2_000_000_000).ToList();

    var indices = new List<int>();
    var rand = new Random();
    for (var i = 0; i < 20000; i++)
    {
        indices.Add(rand.Next(0, largeList.Count - 1));
    }
    indices.Sort();

    var watch = new Stopwatch();
    watch.Start();

    // You can convert the list to array with ToArray,
    // but this duplicate the memory use.
    // Or get the internal array by reflection,
    // but reflection on external library isn't recommended
    var largeArray = (int[])typeof(List<int>)
        .GetField("_items", BindingFlags.Instance | BindingFlags.NonPublic)
        .GetValue(largeList);

    var current = 0;
    var copyFrom = 0;

    for (var i = 0; i < indices.Count; i++)
    {
        var copyTo = indices[i];
        if (copyTo < copyFrom)
        {
            //In case the indice is duplicate,
            //The item is already passed
            continue;
        }
        var copyLength = copyTo - copyFrom;
        Array.Copy(largeArray, copyFrom, largeArray, current, copyLength);
        current += copyLength;
        copyFrom = copyTo + 1;
    }
    //Resize the internal array
    largeList.RemoveRange(largeList.Count - indices.Count, indices.Count);

    watch.Stop();
    Console.WriteLine(watch.Elapsed);
    Console.WriteLine(largeList.Count);
}

When you have multiple items to remove from a List , and replacing the List with a new List is not an option, the most efficient way is to use the RemoveAll method instead of the RemoveAt . The RemoveAll rearranges the internal state of the List only once, instead of doing it once per each item removed.

The RemoveAll accepts aPredicate<T> that will be invoked once for each item in the list (the large list). Unfortunately this delegate doesn't receive the index of the currently tested item. You could however depend on knowing how the RemoveAll is implemented. The source code reveals that the items are tested sequentially in ascending order. So based on this knowledge you could remove the selected indices from the list, very efficiently, with this three-liner:

var indicesSet = new HashSet<int>(indices);
int index = 0;
largeList.RemoveAll(_ => indicesSet.Contains(index++));

But you really shouldn't . This solution will break horribly if a future version of .NET comes with a different internal implementation of the RemoveAll . So consider this to be a dirty hack, and not a production-quality solution to the problem.

I have a very large list of integers (about 2 billion elements) and a list with indices (couple thousand elements) at which I need to remove elements from the first list. My current approach is to loop over all indices in the second list, passing each to the RemoveAt() method of the first list:

indices.Sort();
indices.Reverse();
for (i = 0; i < indices.Count; i++)
{
    largeList.RemoveAt(indices[i]);
}

However, it takes about 2 minutes to finish. I really need to perform this operation much faster. Is there any way of optimizing this?

I have a Intel i9X CPU with 10 cores, so maybe some way of parallel processing?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM