简体   繁体   中英

What is an O(1)-search memory-efficient data structure to store pairs of integers?

Consider this interface:

public interface CoordinateSet {
    boolean contains(int x, int y);
    default boolean contains(Coordinate coord) {
        return contains(coord.x, coord.y);
    }
}

It represents a set of 2-dimensional integer coordinates, and each possible coordinate may be either inside the set ( contains returns true ) or outside ( contains returns false ).

There are many ways we can implement such an interface. The most computationally efficient one would be the implementation backed up by an array:

public class ArrayCoordinateSet implements CoordinateSet {
    private final boolean[][] coords = new boolean[SIZE][SIZE];
    // ...
    @Override
    public boolean contains(int x, int y) {
        return coords[x][y];
    }
    public void add(int x,  int y) {
        coords[x][y] = true;
    }
    // ...

}

However, if SIZE is something large, say, 1000, and there are only, say, 4 cordinates that belong to the set, right in the four angles of a 1000×10000 rectangle, that means the absolute majority of cells space is consumed by false values. For such a sparse CoordinateSet we'd better be using a HashSet -based CoordinateSet :

public final class Coordinate {
    public final int x;
    public final int y;
    public Coordinate(int x, int y) {
        this.x = x;
        this.y = y;
    }
    // .equals() and hashCode()
}
public class HashBasedCoordinateSet implements CoordinateSet {
    private final Set<Coordinate> coords = new HashSet<>();
    @Override
    public boolean contains(int x, int y) {
        return coords.contains(new Coordinate(x, y));
    }
    @Override
    public boolean contains(Coordinate coord) {
         return coords.contains(coord);
    }
    public void add(Coordinate coord) {
        coords.add(coord);
    }
}

However, with the HashBasedCoordinateSet we have such an issue:

for (int x=0; x<1000; x++) {
  for (int y=0; y<1000; y++) {
    hashBasedCoordinateSet.contains(x, y);
  }
}

When we have values x and y and want to check if hashBasedCoordinateSet.contains(x, y) , then that would require creating a new object at each method call (since we always need an object to search in a HashSet , it is not enough to just have object's data). And that would be a real waste of CPU time (it'd need to create all those Coordinate objects and then grabage-collect them, since seemngly no escape-analysis optimisation can be performed on this code).

So finally, my question is:

What would be the data structure to store a sparse set of coordinates that:

  1. Has O(1) contains(int x, int y) operation;
  2. Efficiently uses space (unlike the array-based implementation );
  3. Does not have to create extra objects during contains(int x, int y) ?

A long is twice the size of an integer in Java, so one can store two ints in one long. So how about this?

public class CoordinateSet {
    private HashSet<Long> coordinates = new HashSet<>();

    public void add(int x, int y) {
        coordinates.add((x | (long) y << 32));
    }

    public boolean contains(int x, int y) {
        return coordinates.contains((x | (long) y << 32));
    }
}

I am pretty sure the long on the contains method is stored on the stack.

Optimizing without measuring is of course always dangerous. You probably should profile your app to see if that is really a bottleneck.

You also produce two usecases

  1. Find a single coordinate in a set
  2. Find all coordinates that are part of the the set in a given bound

Step 2 could be much more efficient by walking the iterator of the set, and filtering out the ones that you don't want. This might return the data in arbitrary order. And the performance is greatly dependent on how large the dataset will be.

Maybe a simple Table Datastructure, like the one provided by Guava , could give you a much nicer interface - indexing the X and Y coordinates as ints - while at the same time giving you O(1) access.

Table<Integer, Integer, Coordinate> index = HashBasedTable.create();

Another suggestion is to look into location sensitive hashing. You basically create a new hash function that maps your XY coordinates into a common one dimensional space that is easy to query. But this might be beyond the scope.

If you want to have an O(1) data structure, you need to have a lookup mechanism which is independent of the actual values you want to store in the datastructure. The only way to do this is to enumerate your values and derive a formula to calculate the enumeration value of the pair you have, and then have an array of yes/no value for each enumeration value.

For instance, if you have that x is guaranteed to be between 0 and 79 and y is guaranteed to be between 0 and 24, you can use the enumeration formula y*80+x, which for the pair (10,10) would be 810. Then look up in the very large array of yes/no values if the value stored for 810 is a yes.

So, if you insist on having an O(1) algorithm, you need the space to hold the yes/no values.

You could try a binary tree, using the bits that make up the values of x and y as the key. For example, if x and y are 32-bit integers, the total depth of the tree is 64. So you loop through the bits of x and y, making at most 64 decisions to arrive at a contains/not-contains answer.

Update in response to comments: Granted, trees aren't what you normally think of if you want O(1), but keep in mind the array-based approach in the original question is only O(1) up to an implementation limit on available memory. All I'm doing is assuming the bit length of an integer is a fixed implementation constraint, which is generally a safe assumption. Put another way, if you really want the contains() call to run in constant time, you could code it to always do 64 comparison operations and then return.

Admittedly, a CS professor probably wouldn't buy that argument. Ever since we got rid of the homework tag I've had trouble knowing whether someone wants a real-world answer or a theoretical CS answer

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM