简体   繁体   中英

Efficient insertion/deletion algorithm for an array

I subscribe to a data feed and from that create and maintain a structure using the index values on the INSERT/DELETE messages. I would like to ask the assembled cognoscenti whether they know of any algorithm which can deal with the piecemeal updates in an efficient way - generally batch-updates contain between two and six such messages.

The estimated size of the array is around 1000 elements.

Batch updates arrive as a list of messages ordered by index, which stipulate the insertion or deletion of an item at a given index. I expect most of the churn in the array to be nearer its start than its end.

It occurs to me that with some basic processing I can determine the range affected by the batch and the overall size-delta, and therefore move an unaffected tail section of the array just once.

Similarly, I could maintain a certain amount of free space before the first element and after the final element to do the least amount of copying possible.

Other optimisations include recognising updates such as the following:

DELETE 10, INSERT 10 - effectively a replace which requires no copying  
INSERT 10, DELETE 11 - as above  
DELETE 10, DELETE 10, DELETE 10 - bulk deletion can be optimised into one copy operation  
INSERT 11, INSERT 12, INSERT 13 - bulk insertion can be optimised into one copy operation  

and so on.

However, I'm wary of the overhead in performing the recognition step - it smacks of look-aheads and track-backs, which may well take more time than simply performing the copy.

Given the expected size of the array, tree structures seem heavyweight: some basic performance testing suggests that binary or self-balancing trees (in this case a red-black tree list-implementation) only begin to show performance advantages after around 15K - 20K elements: array-copies are significantly faster at lower sizes. I should probably add that I'm using Java for this implementation.

Any hints, tips or suggestions would be welcomed.

Cheers

Mike

Always weigh code clarity vs. optimization. If there is no performance issue now, just make sure that the code is clear. If there is a performance issue in the future, you'll know its exact nature then. Preparing for it now is an exercise in guesswork.

If you need to manipulate quite a bit, a linked list may be worthy.

For simple clear code, however, I would use apache commons collection utils for a raw array or an arraylist otherwise:

myArray = ArrayUtils.add(myArray, insertionIndex, newItem);

OR

ArrayList<> mylist = new ArrayList<>(Arrays.asList(myArray));
myList.add(insertionIndex, newItem);

In general, if you have the changes listed by index order, you can build a simple loop that only copies things once. Here's some pseudocode:

array items;
array changes; // contains a structure with index, type, an optional data members
array out; // empty, possibly with ensureCapacity(items.length)
int c = 0, delta = 0;
// c is the current change
//delta tracks how indexing has changed by previous operations
for (i = 0; i < items.length; i++) {
    if c < changes.length {
        curchange = changes[c]
        if (i + delta) == curchange.index {
            c++;
            if (curchange.type == INSERT) {
                out.add(curchange.data)
                delta--;
            } else {
                delta++;
                continue; // skip copying i
            }
        }
    }
    out.add(items[i])
}
for (; c < changes.length; c++) { // handle trailing inserts
    assert(c.index == out.length && c.type == INSERT)
    out.add(c.data);
}

That runs through the input array once, and builds the output array with all changes made.

Note that this doesn't handle multiple inserts at the same location. It would make the code a bit more elaborate to do that but it's not too hard.

However, it will always run all the way through the array, once per batch. A slightly tougher change would be to keep a temporary around and do the changes in-place with two index variables; then, if you hit the end of the change list, you could break out of the loop early and not touch the rest of the list.

The simplest would be to run through the updates and copy the array into a new array while applying the updates.

1000 is not that big, it is probably not worthwhile to optimize further.

And to make your life easier better use an ArrayList .

Beyond sorting the individual updates (as you mentioned already) to try and consolidate things, I don't know that I'd bother much. 1000 elements is, frankly, nothing in the large scope of things. I have a system with 25M elements, using simple bulk copies, and it's (for our purposes) far beyond more than fast enough.

So, I won't put the "pre-mature optimization" hat on, but I might glance over at it on the bookshelf first.

Using a linked list ( java.util.LinkedList ) might be something to look into. Getting an element at a particular index is expensive of course, but it may be better than performing array copies.

There is an extremely simple to implement datastructure, named "Cartesian trees" or "Treaps", that allows O(log N) splits, concatenations, insertions and deletions on arrays (and many more things).

2-3 trees are also very simple to implement (my implementation of a somewhat more complex facility had just 1 bug after the first compilation) and fit your purpose.

If space is not a constraint and you are not going to have duplicates, go for Set datastructure, in particular Java's HashSet . The power of this data structure is the insertion and deletion are done in O(1) time which would best suit you if performance is 'the' criterion.

Moreover, whenever you speak of Arrays besides their fast retrieval, you have the serious limitation of numerous array copies that might happen which not only is going to take up space (for array growth) but also the efficiency will be poor as each of Insert/Delete might take O(n) time.

If that's truly what your data set looks like, you might consider duplicate tracking with a Collection (like a HashMap). The Array would be your ordered list with each activity listed in order, and your Collection would be indices to the array.

For example:

class EventQueue
{
  Vector eventQueue;
  HashMap eventMap;

  public synchronized Event getNextEvent()
  {
      Event event = eventQueue.remove(0);
      eventMap.remove(event.getId());  // this would be 10 from 'INSERT 10' 
                                       // in the sample from the OP
  }

  public synchronized addEvent(Event e)
  {
     if( eventMap.containsKey(e.getId())
     {
        // replace events that already exist
        int idx = eventMap.get(e.getId());
        eventQueue.removeElementAt(idx);
        eventQueue.add(idx, e);
     } else {
        // add new events
        eventQueue.add(e);
        eventMap.add(e.getId(), eventQueue.size()); // may be off by one...
     }
  }

  public boolean isReady()
  {
    return eventQueue.size() > 0;
  }
}

class FeedListener extends Thread {
 EventQueue queue;
 EventFeed feed;
 ...
 public void run()
 {
    while(running) {
       sleep(sleepTime);
       if( feed.isEventReady() ) {
         queue.addEvent(feed.getEvent());
       }
    }
 }
}

abstract class EventHandler extends Thread {
 EventQueue queue;
 ...
 public void run()
 {
   while(running)
   {
     sleep(sleepTime);
     if( queue.isReady() )
     {
       Event event = queue.getNextEvent();
       handleEvent(event);
     }
   }
 }

 public abstract void handleEvent(Event event);
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM