简体   繁体   English

什么是O(1)-搜索内存有效的数据结构来存储整数对?

[英]What is an O(1)-search memory-efficient data structure to store pairs of integers?

Consider this interface: 考虑以下接口:

public interface CoordinateSet {
    boolean contains(int x, int y);
    default boolean contains(Coordinate coord) {
        return contains(coord.x, coord.y);
    }
}

It represents a set of 2-dimensional integer coordinates, and each possible coordinate may be either inside the set ( contains returns true ) or outside ( contains returns false ). 它表示一组二维整数坐标,每个可能的坐标可以在该集合内部( contains返回true )或外部( contains返回false )。

There are many ways we can implement such an interface. 我们可以通过多种方式来实现这样的接口。 The most computationally efficient one would be the implementation backed up by an array: 计算效率最高的方法是由数组支持的实现:

public class ArrayCoordinateSet implements CoordinateSet {
    private final boolean[][] coords = new boolean[SIZE][SIZE];
    // ...
    @Override
    public boolean contains(int x, int y) {
        return coords[x][y];
    }
    public void add(int x,  int y) {
        coords[x][y] = true;
    }
    // ...

}

However, if SIZE is something large, say, 1000, and there are only, say, 4 cordinates that belong to the set, right in the four angles of a 1000×10000 rectangle, that means the absolute majority of cells space is consumed by false values. 但是,如果SIZE很大,比如说1000,并且只有4个属于该集合的坐标,就在1000×10000矩形的四个角度中,这意味着绝对大部分的cells空间都被false值。 For such a sparse CoordinateSet we'd better be using a HashSet -based CoordinateSet : 对于这种稀疏的CoordinateSet,我们最好使用基于HashSetCoordinateSet

public final class Coordinate {
    public final int x;
    public final int y;
    public Coordinate(int x, int y) {
        this.x = x;
        this.y = y;
    }
    // .equals() and hashCode()
}
public class HashBasedCoordinateSet implements CoordinateSet {
    private final Set<Coordinate> coords = new HashSet<>();
    @Override
    public boolean contains(int x, int y) {
        return coords.contains(new Coordinate(x, y));
    }
    @Override
    public boolean contains(Coordinate coord) {
         return coords.contains(coord);
    }
    public void add(Coordinate coord) {
        coords.add(coord);
    }
}

However, with the HashBasedCoordinateSet we have such an issue: 但是,对于HashBasedCoordinateSet我们HashBasedCoordinateSet这样的问题:

for (int x=0; x<1000; x++) {
  for (int y=0; y<1000; y++) {
    hashBasedCoordinateSet.contains(x, y);
  }
}

When we have values x and y and want to check if hashBasedCoordinateSet.contains(x, y) , then that would require creating a new object at each method call (since we always need an object to search in a HashSet , it is not enough to just have object's data). 当我们具有值xy并要检查hashBasedCoordinateSet.contains(x, y) ,这将需要在每个方法调用处创建一个新对象(因为我们始终需要一个对象在HashSet进行搜索,这还不够)只是拥有对象的数据)。 And that would be a real waste of CPU time (it'd need to create all those Coordinate objects and then grabage-collect them, since seemngly no escape-analysis optimisation can be performed on this code). 这确实是浪费CPU时间(它需要创建所有这些Coordinate对象,然后抓取并收集它们,因为在此代码上似乎无法进行转义分析优化)。

So finally, my question is: 所以最后,我的问题是:

What would be the data structure to store a sparse set of coordinates that: 存储稀疏坐标集的数据结构是什么:

  1. Has O(1) contains(int x, int y) operation; 具有O(1) contains(int x, int y)操作;
  2. Efficiently uses space (unlike the array-based implementation ); 有效利用空间(与基于数组的实现不同);
  3. Does not have to create extra objects during contains(int x, int y) ? contains(int x, int y)期间不必创建额外的对象吗?

A long is twice the size of an integer in Java, so one can store two ints in one long. long是Java中整数大小的两倍,因此可以在一个long中存储两个int。 So how about this? 那呢?

public class CoordinateSet {
    private HashSet<Long> coordinates = new HashSet<>();

    public void add(int x, int y) {
        coordinates.add((x | (long) y << 32));
    }

    public boolean contains(int x, int y) {
        return coordinates.contains((x | (long) y << 32));
    }
}

I am pretty sure the long on the contains method is stored on the stack. 我很确定contains方法中的long存储在堆栈中。

Optimizing without measuring is of course always dangerous. 当然,不进行评估就进行优化总是很危险的。 You probably should profile your app to see if that is really a bottleneck. 您可能应该配置您的应用程序,以查看是否确实存在瓶颈。

You also produce two usecases 您还会产生两个用例

  1. Find a single coordinate in a set 在集合中查找单个坐标
  2. Find all coordinates that are part of the the set in a given bound 查找给定范围内属于集合的所有坐标

Step 2 could be much more efficient by walking the iterator of the set, and filtering out the ones that you don't want. 通过遍历集合的迭代器,并过滤掉不需要的迭代器,可以使步骤2效率更高。 This might return the data in arbitrary order. 这可能会以任意顺序返回数据。 And the performance is greatly dependent on how large the dataset will be. 而且性能很大程度上取决于数据集的大小。

Maybe a simple Table Datastructure, like the one provided by Guava , could give you a much nicer interface - indexing the X and Y coordinates as ints - while at the same time giving you O(1) access. 也许像Guava提供的那样,一个简单的表数据结构可以为您提供更好的接口-将X和Y坐标索引为整数,同时为您提供O(1)访问。

Table<Integer, Integer, Coordinate> index = HashBasedTable.create();

Another suggestion is to look into location sensitive hashing. 另一个建议是研究位置敏感的哈希。 You basically create a new hash function that maps your XY coordinates into a common one dimensional space that is easy to query. 基本上,您将创建一个新的哈希函数,该函数将XY坐标映射到易于查询的公共一维空间中。 But this might be beyond the scope. 但这可能超出范围。

If you want to have an O(1) data structure, you need to have a lookup mechanism which is independent of the actual values you want to store in the datastructure. 如果要具有O(1)数据结构,则需要具有独立于要存储在数据结构中的实际值的查找机制。 The only way to do this is to enumerate your values and derive a formula to calculate the enumeration value of the pair you have, and then have an array of yes/no value for each enumeration value. 做到这一点的唯一方法是枚举您的值并派生一个公式来计算您拥有的货币对的枚举值,然后为每个枚举值设置一个是/否值的数组。

For instance, if you have that x is guaranteed to be between 0 and 79 and y is guaranteed to be between 0 and 24, you can use the enumeration formula y*80+x, which for the pair (10,10) would be 810. Then look up in the very large array of yes/no values if the value stored for 810 is a yes. 例如,如果确保x保证在0到79之间,并且y保证在0到24之间,则可以使用枚举公式y * 80 + x,对于(10,10)对810。然后,如果为810存储的值是“是”,则在很大的“是/否”值数组中查找。

So, if you insist on having an O(1) algorithm, you need the space to hold the yes/no values. 因此,如果您坚持使用O(1)算法,则需要空间来保存yes / no值。

You could try a binary tree, using the bits that make up the values of x and y as the key. 您可以使用构成x和y值的位作为键来尝试二叉树。 For example, if x and y are 32-bit integers, the total depth of the tree is 64. So you loop through the bits of x and y, making at most 64 decisions to arrive at a contains/not-contains answer. 例如,如果x和y是32位整数,则树的总深度为64。因此,您将遍历x和y的位,最多进行64个决策才能得出包含/不包含的答案。

Update in response to comments: Granted, trees aren't what you normally think of if you want O(1), but keep in mind the array-based approach in the original question is only O(1) up to an implementation limit on available memory. 更新以回应评论:当然,如果您想要O(1),树不是您通常想到的,但是请记住,原始问题中基于数组的方法只有O(1)达到实现的限制有效内存。 All I'm doing is assuming the bit length of an integer is a fixed implementation constraint, which is generally a safe assumption. 我正在做的是假设整数的位长是固定的实现约束,这通常是一个安全的假设。 Put another way, if you really want the contains() call to run in constant time, you could code it to always do 64 comparison operations and then return. 换句话说,如果您确实希望contains()调用在恒定时间内运行,则可以对其进行编码,使其始终执行64个比较操作,然后返回。

Admittedly, a CS professor probably wouldn't buy that argument. 诚然,CS教授可能不会接受这种说法。 Ever since we got rid of the homework tag I've had trouble knowing whether someone wants a real-world answer or a theoretical CS answer 自从我们删除作业标签以来,我一直很难知道有人是想要真实答案还是理论CS答案

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM