How can I optimize this code?

Question

My current project has us using TreeSet and TreeMap in Java, with an input array of 10514 Song elements read in from a text file. Each Song contains a Artist, Title and Lyric fields. The aim of this project is to conduct fast searches on the lyrics using sets and maps.

First, I iterate over the input Song array, accessing the lyrics field and creating a Scanner object to iterate over the lyric words using this code: commonWords is a TreeSet of words that should not be keys, and lyricWords is the overall map of words to Songs.

public void buildSongMap() {
    for (Song song:songs) {
        //method variables
        String currentLyrics= song.getLyrics().toLowerCase(); 
        TreeSet<Song> addToSet=null;
        Scanner readIn= new Scanner(currentLyrics);
        String word= readIn.next();

        while (readIn.hasNext()) {

            if (!commonWords.contains(word) && !word.equals("") && word.length()>1) {
                if (lyricWords.containsKey(word)) {
                    addToSet= lyricWords.get(word);
                    addToSet.add(song);
                    word=readIn.next();
                } else 
                    buildSongSet(word);

            } else 
                word= readIn.next();
        }

    }

In order to build the songSet, I use this code:

public void buildSongSet(String word) {     
    TreeSet<Song> songSet= new TreeSet<Song>();
    for (Song song:songs) {
        //adds song to set 
        if (song.getLyrics().contains(word)) {
            songSet.add(song);
        }
    }
    lyricWords.put(word, songSet);
    System.out.println("Word added "+word);
}

Now, since buildSongSet is called from inside a loop, creating the map executes in N^2 time. When the input array is 4 songs, searches run very fast, but when using the full array of 10514 elements, it can take over 15+ min to build the map on a 2.4GHz machine with 6 GiB RAM. What can I do to make this code more efficient? Unfortunately, reducing the input data is not an option.

Answer 1

It looks like your buildSongSet is doing redundant work. Your block:

if (lyricWords.containsKey(word)) {
    addToSet= lyricWords.get(word);
    addToSet.add(song);
    word=readIn.next();
}

adds a song to an existing set. So, when you find a word you don't know about, just add one song to it. Change buildSongSet to:

public void buildSongSet(String word, Song firstSongWithWord) {     
    TreeSet<Song> songSet= new TreeSet<Song>();
    songSet.add(firstSongWithWord);
    lyricWords.put(word, songSet);
    System.out.println("Word added "+word);
}

the remaining songs left to be iterated will then be added to that songset from the first block of code if they contain that word. I think that should work.

EDIT just saw this was homework... so removed the HashSet recommendations..

Ok.. so suppose you have these Songs in order with lyrics:

Song 1 - foo
Song 2 - foo bar
Song 3 - foo bar baz

Song 1 will see that foo does not contain lyricWords, so it will call buildSongSet and create a set for foo. It will add itself into the set containing foo.

Song 2 will see that foo is in lyricWords, and add itself to the set. It will see bar is not in the set, and create a set and add itself. It doesn't need to traverse previous songs since the first time the word was seen was in Song 2.

Song 3 follows the same logic.

Another thing you can try doing to optimize your code is to figure out a way to not process duplicate words in the lyrics. if your lyrics are foo foo foo foo bar bar bar foo bar then you're going to be doing a lot of unnecessary checks.

EDIT also see rsp's answer - additional speedups there, but the big speedup is getting rid of the inner loop - glad it's down to 15 secs now.

Answer 2

The whole buildSongSet() method is not needed imho, as your main loop already adds songs to the collection by word. The only thing you are missing is the addition of a set for a new word, something like:

if (lyricWords.containsKey(word)) {
    addToSet= lyricWords.get(word);
} else {
    addToSet = new TreeSet();
    lyricWords.put(word, addToSet);
}
addToSet.add(song);

One issue that you did not tackle is that songs end up being added to the set multiple times, for every occurence of the word in the song.

Another issue is that in the case that a song contains just 1 word, you do not add it at all! It is always better to check the condition first:

String word = null;
while (readIn.hasNext()) {
    word = readIn.next();

Your condition is doing one check too many (the empty string has length < 1), and swapping the checks can speed up things too:

if (word.length() > 1 && !commonWords.contains(word)) {

Answer 3

Please, try change TreeSet to HashSet. I can't see where you obtain the benefits of TreeSet.

Answer 4

if you want a very extensible, easy way of solving this with performance in the order of a few millisecons. Consider lucene http://lucene.apache.org/

refer to my answer here for example of how to index and search How do I index and search text files in Lucene 3.0.2?

How can I optimize this code?

Question

4 answers

solution1
6 ACCPTED 2010-11-03 16:27:57

solution2
4 2010-11-03 17:11:20

solution3
3 2010-11-03 16:30:53

solution4
0 2010-11-03 22:29:33

How can I optimize this code?

Question

4 answers

solution1 6 ACCPTED 2010-11-03 16:27:57

solution2 4 2010-11-03 17:11:20

solution3 3 2010-11-03 16:30:53

solution4 0 2010-11-03 22:29:33

solution1
6 ACCPTED 2010-11-03 16:27:57

solution2
4 2010-11-03 17:11:20

solution3
3 2010-11-03 16:30:53

solution4
0 2010-11-03 22:29:33