Implementing efficient data structure using Arrays only

Question

As part of my programming course I was given an exercise to implement my own String collection. I was planning on using ArrayList collection or similar but one of the constraints is that we are not allowed to use any Java API to implement it, so only arrays are allowed. I could have implemented this using arrays however efficiency is very important as well as the amount of data that this code will be tested with. I was suggested to use hash tables or ordered tress as they are more efficient than arrays. After doing some research I decided to go with hash tables because they seemed easy to understand and implement but once I started writing code I realised it is not as straight forward as I thought.

So here are the problems I have come up with and would like some advice on what is the best approach to solve them again with efficiency in mind:

ACTUAL SIZE: If I understood it correctly hash tables are not ordered (indexed) so that means that there are going to be gaps in between items because hash function gives different indices. So how do I know when array is full and I need to resize it?
RESIZE: One of the difficulties that I need to create a dynamic data structure using arrays. So if I have an array String[100] once it gets full I will need to resize it by some factor I decided to increase it by 100 each time so once I would do that I would need to change positions of all existing values since their hash keys will be different as the key is calculated:

int position = "orange".hashCode() % currentArraySize;

So if I try to find a certain value its hash key will be different from what it was when array was smaller.

HASH FUNCTION: I was also wondering if built-in hashCode() method in String class is efficient and suitable for what I am trying to implement or is it better to create my own one.
DEALING WITH MULTIPLE OCCURRENCES: one of the requirements is to be able to add multiple words that are the same, because I need to be able to count how many times the word is stored in my collection. Since they are going to have the same hash code I was planning to add the next occurrence at the next index hoping that there will be a gap. I don't know if it is the best solution but here how I implemented it:

public int count(String word) {
    int count = 0;
    while (collection[(word.hashCode() % size) + count] != null && collection[(word.hashCode() % size) + count].equals(word))
        count++;
    return count;
}

Thank you in advance for you advice. Please ask anything needs to be clarified.

PS The length of words is not fixed and varies greatly.

UPDATE Thank you for your advice, I know I did do few stupid mistakes there I will try better. So I took all your suggestions and quickly came up with the following structure, it is not elegant but I hope it is what you roughly what you meant. I did have to make few judgements such as bucket size, for now I halve the size of elements, but is there a way to calculate or some general value? Another uncertainty was as to by what factor to increase my array, should I multiply by some n number or adding fixed number is also applicable? Also I was wondering about general efficiency because I am actually creating instances of classes, but String is a class to so I am guessing the difference in performance should not be too big?

Answer 1

ACTUAL SIZE: The built-in Java HashMap just resizes when the total number of elements exceeds the number of buckets multiplied by a number called the load factor, which is by default 0.75. It does not take into account how many buckets are actually full. You don't have to, either.

RESIZE: Yes, you'll have to rehash everything when the table is resized, which does include recomputing its hash.

So if I try to find a certain value it's hash key will be different from what it was when array was smaller.

Yup.

HASH FUNCTION: Yes, you should use the built in hashCode() function. It's good enough for basic purposes.

DEALING WITH MULTIPLE OCCURRENCES: This is complicated. One simple solution would just be to have the hash entry for a given string also keep count of how many occurrences of that string are present. That is, instead of keeping multiple copies of the same string in your hash table, keep an int along with each String counting its occurrences.

Answer 2

So how do I know when array is full and I need to resize it?

You keep track of the size and HashMap does. When the size used > capacity * load factor you grow the underlying array, either as a whole or in part.

int position = "orange".hashCode() % currentArraySize;

Some things to consider.

The % of a negative value is a negative value.
Math.abs can return a negative value.
Using & with a bit mask is faster however you need a size which is a power of 2.

I was also wondering if built-in hashCode() method in String class is efficient and suitable for what I am trying to implement or is it better to create my own one.

The built in hashCode is cached, so it is fast. However it is not a great hashCode and has poor randomness for lower bit, and higher bit for short strings. You might want to implement your own hashing strategy, possibly a 64-bit one.

DEALING WITH MULTIPLE OCCURRENCES:

This is usually done with a counter for each key. This way you can have say 32767 duplicates (if you use short) or 2 billion (if you use int) duplicates of the same key/element.

Implementing efficient data structure using Arrays only

Question

2 answers

solution1
1 2015-12-09 22:31:29

solution2
1 2015-12-09 22:35:05

Implementing efficient data structure using Arrays only

Question

2 answers

solution1 1 2015-12-09 22:31:29

solution2 1 2015-12-09 22:35:05

solution1
1 2015-12-09 22:31:29

solution2
1 2015-12-09 22:35:05