简体   繁体   中英

Most efficient data structure for searching for a word in text Java

I have a program which reads in a document and searches each page for a given search word. It then returns which pages the word appears in.

ie the word "brilliant" appears in the following pages: 1,4,6,8

At the moment I split the file into pages and store this into an ArrayList. Each element of the ArrayList contains one page of the document

I then split each word on the page and store it into a hashMap, with the KEY being the position in the text this word appears in (i need to know this for other functionality) and the value being the word. I then search through the HashMap using;

if (map.containsValue(searchString) == true)
                return true;
             else
                 return false;

I do this for each PAGE.

Everything is working but I was wondering if there is a more efficient data structure I can use which stores all the words on a given page as well as the position on the page it appears?(since searching through the values in a map without giving a key is 0(n)).

I need to be able to search through this structure and find a word. Remember I also need the position for later use.

The code that i use to populate the map with the positions of the word in the text is;

    // text is the page of text from a document as a string
int key = 1; // position of the word in the text
    for (String element : text.split(" "))
            {
                map.put(key, element);
                key++;
            }

Why not just use a single HashMap<String,ArrayList<Position>> that maps the words to the occurrences? Each word of the text would be a key in the map, the page number and position would form the entry list.

Insertion is slightly tricky because of the list value:

ArrayList<Position> positions = words.get(word);
if (positions == null) {
  positions = new ArrayList<Position>();
  words.put(word, positions);
}
positions.add(position);

Alernatively, you could use a Guava Multimap: http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/collect/Multimap.html (In particular if you are using Guava for other purposes already -- I'd probably avoid pulling in a library dependency just for this)

Edit : Changed Integer to Position (and the set to a list), had overlooked that the exact position is required. Position should be similar to

class Position {
  int page;
  int index; 
}

I would probably use Lucene or something from Guava collections myself, but barring that I think the most efficient structure would be:

HashMap<String, TreeMap<Integer, TreeSet<Integer>>> words;

        ^^^^^^          ^^^^^^^          ^^^^^^^
         word            page            position

Using words.get("brilliant").keySet(); would immediately give you all the pages that "brilliant" appears on. That's O(log n) instead of O(n) if I'm not mistaken.

After reading in the comments that you will also need to retrieve the word before and after each search words, I think you'll need a second data structure for that lookup:

TreeSet<Integer, TreeMap<Integer, String>> positions;

        ^^^^^^^          ^^^^^^^  ^^^^^^
         page            position  word

Or alternatively, using the respective indexes of the two lists for page and position:

ArrayList<ArrayList<String>> positions;          

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM