Deleting from array, mirrored (strange) behavior

Question

The title may seem a little odd, because I have no idea how to describe this in one sentence.

For the course Algorithms we have to micro-optimize some stuff, one is finding out how deleting from an array works. The assignment is delete something from an array and re-align the contents so that there are no gaps, I think it is quite similar to how std::vector::erase works from c++.

Because I like the idea of understanding everything low-level, I went a little further and tried to bench my solutions. This presented some weird results.

At first, here is a little code that I used:

class Test {

    Stopwatch sw;
    Obj[] objs;

    public Test() {
        this.sw = new Stopwatch();
        this.objs = new Obj[1000000];

        // Fill objs
        for (int i = 0; i < objs.Length; i++) {
            objs[i] = new Obj(i);
        }
    }

    public void test() {

        // Time deletion
        sw.Restart();
        deleteValue(400000, objs);
        sw.Stop();

        // Show timings
        Console.WriteLine(sw.Elapsed);
    }

    // Delete function
    // value is the to-search-for item in the list of objects
    private static void deleteValue(int value, Obj[] list) {

        for (int i = 0; i < list.Length; i++) {

            if (list[i].Value == value) {
                for (int j = i; j < list.Length - 1; j++) {
                    list[j] = list[j + 1];

                    //if (list[j + 1] == null) {
                    //    break;
                    //}
                }
                list[list.Length - 1] = null;
                break;
            }
        }
    }
}

I would just create this class and call the test() method. I did this in a loop for 25 times.

My findings:

The first round it takes a lot longer than the other 24, I think this is because of caching, but I am not sure.
When I use a value that is in the start of the list, it has to move more items in memory than when I use a value at the end, though it still seems to take less time.
Benchtimes differ quite a bit.
When I enable the commented if, performance goes up (10-20%) even if the value I search for is almost at the end of the list (which means the if goes off a lot of times without actually being useful).

I have no idea why these things happen, is there someone who can explain (some of) them? And maybe if someone sees this who is a pro at this, where can I find more info to do this the most efficient way?

Edit after testing:

I did some testing and found some interesting results. I run the test on an array with a size of a million items, filled with a million objects. I run that 25 times and report the cumulative time in milliseconds. I do that 10 times and take the average of that as a final value.

When I run the test with my function described just above here I get a score of: 362,1

When I run it with the answer of dbc I get a score of: 846,4

So mine was faster, but then I started to experiment with a half empty empty array and things started to get weird. To get rid of the inevitable nullPointerExceptions I added an extra check to the if (thinking it would ruin a bit more of the performance) like so:

if (fromItem != null && fromItem.Value != value)
    list[to++] = fromItem;

This seemed to not only work, but improve performance dramatically! Now I get a score of: 247,9

The weird thing is, the scores seem to low to be true, but sometimes spike, this is the set I took the avg from: 94, 26, 966, 36, 632, 95, 47, 35, 109, 439

So the extra evaluation seems to improve my performance, despite of doing an extra check. How is this possible?

Answer 1

You are using Stopwatch to time your method. This calculates the total clock time taken during your method call, which could include the time required for .Net to initially JIT your method , interruptions for garbage collection , or slowdowns caused by system loads from other processes. Noise from these sources will likely dominate noise due to cache misses.

This answer gives some suggestions as to how you can minimize some of the noise from garbage collection or other processes. To eliminate JIT noise, you should call your method once without timing it -- or show the time taken by the first call in a separate column in your results table since it will be so different. You might also consider using a proper profiler which will report exactly how much time your code used exclusive of "noise" from other threads or processes.

Finally, I'll note that your algorithm to remove matching items from an array and shift everything else down uses a nested loop, which is not necessary and will access items in the array after the matching index twice. The standard algorithm looks like this:

    public static void RemoveFromArray(this Obj[] array, int value)
    {
        int to = 0;
        for (int from = 0; from < array.Length; from++)
        {
            var fromItem = array[from];
            if (fromItem.Value != value)
                array[to++] = fromItem;
        }
        for (; to < array.Length; to++)
        {
            array[to] = default(Obj);
        }
    }

However, instead of using the standard algorithm you might experiment by using Array.RemoveAt() with your version, since (I believe) internally it does the removal in unmanaged code.

Deleting from array, mirrored (strange) behavior

Question

1 answers

solution1
2 ACCPTED 2014-09-03 14:28:46

Deleting from array, mirrored (strange) behavior

Question

1 answers

solution1 2 ACCPTED 2014-09-03 14:28:46

solution1
2 ACCPTED 2014-09-03 14:28:46