简体   繁体   中英

Building an inverted index in Java-logic

I have a collection of around 1500 documents. I parsed through each document and extract tokens. These tokens are stored in an hashmap(as key) and the total number of times they occur in the collection (ie frequency) is stored as the value.

I have to extend this to build an inverted index. That is, the term(key)| number of documents it occurs it-->DocNo|Frequency in that document. For exmple,

    Term       DocFreq    DocNum      TermFreq  
  data           3           1            12  
                            23            31  
                            100           17  
  customer       2          22            43  
                            19            2  

Currently, I have the following in Java,

hashmap<string,integer>  
for(each document)  
{  
    extract line  
    for(each line)  
    {  
        extract word   
        for(each word)  
        {  
            perform some operations  
            get value for word from hashmap and increment by one  
        }  
    }  
}  

I have to build on this code. I can't really think of a good way to implement an inverted index. So far, I thought of making value a 2D array. So the term would be the key and the value(ie 2D array) would store the docId and termFreq.

Please let me know if my logic is correct.

I would do it by using a Map<String, TermFrequencies> . This map would maintain a TermFrequencies object for each term found. The TermFrequencies object would have the following methods:

void addOccurrence(String documentId);
int getTotalNumberOfOccurrences();
Set<String> getDocumentIds();
int getNumberOfOccurrencesInDocument(String documentId);

It would use a Map<String, Integer> internally to associate each document the term occurs in with the number of occurrences of the term in the document.

The algorithm would be extremely simple:

for(each document) {  
    extract line  
    for(each line) {  
        extract word   
        for(each word) {  
            TermFrequencies termFrequencies = map.get(word);
            if (termFrequencies == null) {
                termFrequencies = new TermFrequencies(word);
            }
            termFrequencies.addOccurrence(document);
        }  
    }  
}  

The addOccurrence() method would simply increment a counter for the total number of occurrences, and would insert or update the number of occurrences in the internam map.

I think it is best to have two structures: a Map<docnum, Map<term,termFreq>> and a Map<term, Set<docnum>> . Your docFreqs can be read off as set.size in the values of the second map. This solution involves no custom classes and allows a quick retrieval of everything needed.

The first map contains all the informantion and the second one is a derivative that allows quick lookup by term. As you process a document, you fill the first map. You can derive the second map afterwards, but it is also easy to do it in one pass.

I once implemented what you're asking for. The problem with your approach is that it is not abstract enough. You should model Terms, Documents and their relationships using objects. In a first run, create the term index and document objects and iterate over all terms in the documents while populating the term index. Afterwards, you have a representation in memory that you can easily transform into the desired output. Do not start by thinking about 2d-arrays in an object oriented language. Unless you want to solve a mathematical problem or optimize something it's not the right approach most of the time.

I dont know if this is still a hot question, but I would recommend you to do it like this:

You run over all your documents and give them an id in increasing order. For each document you run over all the words.

Now you have a Hashmap that maps Strings (your words) to an array of DocTermObjects. A DocTermObject contains a docId and a TermFrequency.

Now for each word in a document, you look it up in your HashMap, if it doesn't contain an Array of DocTermObjects you create it, else you look at its very LAST element only (this is important due to runtime, think about it). If this element has the docId that you treat at the moment, you increase the TermFrequency. Else or if the Array is empty, you add a new DocTermObject with your actual docId and set the TermFrequency to 1.

Later you can use this datastructure to compute scores for example. The scores you could also save in the DoctermObjects of course.

Hope it helped :)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM