简体   繁体   中英

Filtering term count in Lucene (Java)

I'm currently trying to get the amount of appearences of each word in a description field using Lucene. Fe

  • description: BOX OF APPLES
  • description: BOX OF BANANAS

output:

  • BOX 2
  • OF 2
  • APPLES 1
  • BANANAS 1

I am looking to get the word and the frequency.

The thing is I would like to filter those results to a given document, I mean only count the words in the description field of a given document.

Thanks for any assistance given.

//in answer to comment: I have something like this:

public ArrayList<ObjectA> GetIndexTerms(String code) {
        try {

            ArrayList<Object> termlist = new ArrayList<ObjectA>();
            indexR = IndexReader.open(path); 
            TermEnum terms = indexR.terms();           

            while (terms.next()) {
                Term term = terms.term();
                String termText = term.text();                    
                int frequency = indexR.docFreq(term); 
                ObjectA newObj = new ObjectA(termText, frequency);
                termlist.add(newObj);                      
                }                   
            }               
            return termlist;
        } catch (Exception ex) {               
            ex.printStackTrace();
            return null;
        }
}

But i don't see how to filter it by document...


//TODAY!

Using the termfreqvec I can get it to work but it takes de doc id and I can't use it right. Since I used a query de "i" value starts in 0 and that's not the proper doc id. Any ideas to get this working properly? Thanks!

    TopDocs tp = indexS.search(query, Integer.MAX_VALUE);
        for (int i = 0; i < tp.scoreDocs.length; i++){  
            ScoreDoc sds = tp.scoreDocs[i];
            Document doc = indexS.doc(sds.doc);
            TermFreqVector tfv = indexR.getTermFreqVector(i,"description");

            for (int j = 0; j < tfv.getTerms().length; j++) {
                String item = tfv.getTerms()[j];
                termlist.add(new TerminoDescripcion(item.toUpperCase(), tfv.getTermFrequencies()[j]));
            }
        }

The problem is that Lucene is an inverted index, meaning that it makes it easy to retrieve documents based on terms, whereas you are looking for the opposite, ie retrieveing terms based on documents.

Hopefully, this is a recurrent problem and Lucene gives you the ability to retrieve terms for a document ( term vectors ) provided that you enabled this feature at indexing time.

See TermVector.YES and Field constructor to know how to enable them at indexing time and IndexReader to know how to retrieve term vectors at search time.

Alternatively, you could re-analyze a stored field on the fly, but this may be slower, especially on large fields.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM