简体   繁体   English

仅使用数组实现有效的数据结构

[英]Implementing efficient data structure using Arrays only

As part of my programming course I was given an exercise to implement my own String collection. 作为编程课程的一部分,我练习了实现自己的String集合。 I was planning on using ArrayList collection or similar but one of the constraints is that we are not allowed to use any Java API to implement it, so only arrays are allowed. 我当时计划使用ArrayList集合或类似的集合,但是其中的限制之一是不允许我们使用任何Java API来实现它,因此只允许使用数组。 I could have implemented this using arrays however efficiency is very important as well as the amount of data that this code will be tested with. 我本可以使用数组来实现的,但是效率和测试此代码的数据量非常重要。 I was suggested to use hash tables or ordered tress as they are more efficient than arrays. 建议我使用哈希表或有序的发束,因为它们比数组更有效。 After doing some research I decided to go with hash tables because they seemed easy to understand and implement but once I started writing code I realised it is not as straight forward as I thought. 经过一些研究之后,我决定使用哈希表,因为它们似乎易于理解和实现,但是一旦我开始编写代码,我就意识到它并没有我想象的那么直接。

So here are the problems I have come up with and would like some advice on what is the best approach to solve them again with efficiency in mind: 因此,这里是我提出的问题,并想就如何以效率再次解决这些问题的最佳方法提供一些建议:

  1. ACTUAL SIZE: If I understood it correctly hash tables are not ordered (indexed) so that means that there are going to be gaps in between items because hash function gives different indices. 实际大小:如果我理解正确,则哈希表没有排序(索引),因此这意味着项之间会有空隙,因为哈希函数给出的索引不同。 So how do I know when array is full and I need to resize it? 那么我怎么知道什么时候数组已满并且需要调整大小呢?
  2. RESIZE: One of the difficulties that I need to create a dynamic data structure using arrays. 调整大小:使用数组创建动态数据结构所需的困难之一。 So if I have an array String[100] once it gets full I will need to resize it by some factor I decided to increase it by 100 each time so once I would do that I would need to change positions of all existing values since their hash keys will be different as the key is calculated: 因此,如果我有一个数组String [100],一旦数组变满,我将需要将其调整大小,因此我决定每次将其增加100,所以一旦这样做,我将需要更改所有现有值的位置,因为它们的位置哈希键将随着密钥的计算而有所不同:
int position = "orange".hashCode() % currentArraySize;

So if I try to find a certain value its hash key will be different from what it was when array was smaller. 因此,如果我尝试找到某个值,则其哈希键将与数组较小时的哈希键不同。

  1. HASH FUNCTION: I was also wondering if built-in hashCode() method in String class is efficient and suitable for what I am trying to implement or is it better to create my own one. 哈希函数:我还想知道String类中的内置hashCode()方法是否高效且适合于我要实现的功能,还是创建自己的更好。
  2. DEALING WITH MULTIPLE OCCURRENCES: one of the requirements is to be able to add multiple words that are the same, because I need to be able to count how many times the word is stored in my collection. 处理多种情况:要求之一是能够添加相同的多个单词,因为我需要能够计算单词在集合中存储的次数。 Since they are going to have the same hash code I was planning to add the next occurrence at the next index hoping that there will be a gap. 由于它们将具有相同的哈希码,因此我计划在下一个索引处添加下一个匹配项,希望会有一个间隙。 I don't know if it is the best solution but here how I implemented it: 我不知道这是否是最好的解决方案,但在这里我是如何实现的:
public int count(String word) {
    int count = 0;
    while (collection[(word.hashCode() % size) + count] != null && collection[(word.hashCode() % size) + count].equals(word))
        count++;
    return count;
}

Thank you in advance for you advice. 预先感谢您的建议。 Please ask anything needs to be clarified. 请问任何需要澄清的问题。

PS The length of words is not fixed and varies greatly. PS字长不固定,变化很大。

UPDATE Thank you for your advice, I know I did do few stupid mistakes there I will try better. 更新谢谢您的建议,我知道我确实做了一些愚蠢的错误,我会尽力而为。 So I took all your suggestions and quickly came up with the following structure, it is not elegant but I hope it is what you roughly what you meant. 因此,我采纳了您的所有建议,并迅速提出了以下结构,虽然它并不优雅,但我希望它大致就是您的意思。 I did have to make few judgements such as bucket size, for now I halve the size of elements, but is there a way to calculate or some general value? 我确实需要做出一些判断,例如存储桶大小,现在我将元素的大小减半了,但是有没有一种计算方法或一些通用值? Another uncertainty was as to by what factor to increase my array, should I multiply by some n number or adding fixed number is also applicable? 另一个不确定因素是增加数组的因素是什么,我应该乘以n个数还是加上固定数也适用? Also I was wondering about general efficiency because I am actually creating instances of classes, but String is a class to so I am guessing the difference in performance should not be too big? 我也想知道总体效率,因为我实际上是在创建类的实例,但是String是要使用的类,因此我猜测性能的差异应该不会太大?

ACTUAL SIZE: The built-in Java HashMap just resizes when the total number of elements exceeds the number of buckets multiplied by a number called the load factor, which is by default 0.75. 实际大小:内置Java HashMap仅在元素总数超过存储桶数乘以称为负载因子的数字(默认值为0.75)时才调整大小。 It does not take into account how many buckets are actually full. 没有考虑实际上有多少个存储桶已满。 You don't have to, either. 您也不必。

RESIZE: Yes, you'll have to rehash everything when the table is resized, which does include recomputing its hash. 调整大小:是的,调整表大小时,您必须重新哈希所有内容,其中包括重新计算其哈希值。

So if I try to find a certain value it's hash key will be different from what it was when array was smaller. 因此,如果我尝试查找某个值,则其哈希键将与数组较小时的哈希键不同。

Yup. 对。

HASH FUNCTION: Yes, you should use the built in hashCode() function. 哈希函数:是的,您应该使用内置的hashCode()函数。 It's good enough for basic purposes. 对于基本目的来说已经足够了。

DEALING WITH MULTIPLE OCCURRENCES: This is complicated. 处理多种情况:这很复杂。 One simple solution would just be to have the hash entry for a given string also keep count of how many occurrences of that string are present. 一种简单的解决方案是,使给定字符串的哈希条目也保持该字符串出现次数的计数。 That is, instead of keeping multiple copies of the same string in your hash table, keep an int along with each String counting its occurrences. 也就是说,与其在哈希表中保留相同字符串的多个副本,不如将每个String计数为int同时保留一个int

So how do I know when array is full and I need to resize it? 那么我怎么知道什么时候数组已满并且需要调整大小呢?

You keep track of the size and HashMap does. 您跟踪大小,而HashMap跟踪大小。 When the size used > capacity * load factor you grow the underlying array, either as a whole or in part. 当使用的size used > capacity * load factor您可以整体或部分增长基础数组。

int position = "orange".hashCode() % currentArraySize; int position =“ orange” .hashCode()%currentArraySize;

Some things to consider. 一些事情要考虑。

  • The % of a negative value is a negative value. 所述%负值的是一个负值。
  • Math.abs can return a negative value. Math.abs可以返回负值。
  • Using & with a bit mask is faster however you need a size which is a power of 2. &与位掩码一起使用会更快,但是您需要的大小是2的幂。

I was also wondering if built-in hashCode() method in String class is efficient and suitable for what I am trying to implement or is it better to create my own one. 我还想知道String类中的内置hashCode()方法是否高效且适合于我要实现的功能,还是创建自己的方法更好。

The built in hashCode is cached, so it is fast. 内置的hashCode被缓存,因此速度很快。 However it is not a great hashCode and has poor randomness for lower bit, and higher bit for short strings. 但是,它不是一个很好的hashCode,对于较低的位和较短的字符串的较高位,其随机性均较差。 You might want to implement your own hashing strategy, possibly a 64-bit one. 您可能想要实现自己的哈希策略,可能是64位的。

DEALING WITH MULTIPLE OCCURRENCES: 处理多种事件:

This is usually done with a counter for each key. 通常使用每个密钥的计数器来完成此操作。 This way you can have say 32767 duplicates (if you use short) or 2 billion (if you use int) duplicates of the same key/element. 这样,您可以说32767个重复(如果使用short)或20亿个重复(如果使用int)相同键/元素的重复。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM