I'm upgrading from Lucene 3.6 to Lucene 4.0-beta. In Lucene 3.x, the IndexReader
contains a method IndexReader.getTermFreqVectors()
, which I can use to extract the frequency of each term in a given document and field.
This method is now replaced by IndexReader.getTermVectors()
, which returns Terms
. How can I make use of this (or probably other methods) to extract the term frequency in a document and a field?
Perhaps this will help you:
// get terms vectors for one document and one field
Terms terms = reader.getTermVector(docID, "fieldName");
if (terms != null && terms.size() > 0) {
// access the terms for this field
TermsEnum termsEnum = terms.iterator(null);
BytesRef term = null;
// explore the terms for this field
while ((term = termsEnum.next()) != null) {
// enumerate through documents, in this case only one
DocsEnum docsEnum = termsEnum.docs(null, null);
int docIdEnum;
while ((docIdEnum = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
// get the term frequency in the document
System.out.println(term.utf8ToString()+ " " + docIdEnum + " " + docsEnum.freq());
}
}
}
See this related question , specificially
Terms vector = reader.getTermVector(docId, CONTENT);
TermsEnum termsEnum = null;
termsEnum = vector.iterator(termsEnum);
Map<String, Integer> frequencies = new HashMap<>();
BytesRef text = null;
while ((text = termsEnum.next()) != null) {
String term = text.utf8ToString();
int freq = (int) termsEnum.totalTermFreq();
frequencies.put(term, freq);
terms.add(term);
}
There is various documentation on how to use the flexible indexing apis:
Accessing the Fields/Terms for a documents term vectors is the exact same API you use for accessing the postings lists, since term vectors are really just a miniature inverted index for just that one document.
So its perfectly OK to use all those examples as-is, though you can make some shortcuts since you know there is only ever one document in this "miniature inverted index". eg if you just want to get the frequency of a term you can just seek to it and use the aggregate statistics like totalTermFreq (see https://builds.apache.org/job/Lucene-Artifacts-4.x/javadoc/core/org/apache/lucene/index/package-summary.html#stats ), rather than actually opening a DocsEnum that will only enumerate over a single document.
I have this working on my Lucene 4.2 index. This is a small test program that works for me.
try {
directory[0] = new SimpleFSDirectory(new File(test1));
directory[1] = new SimpleFSDirectory(new File(test2));
directory[2] = new SimpleFSDirectory(new File(test3));
directoryReader[0] = DirectoryReader.open(directory[0]);
directoryReader[1] = DirectoryReader.open(directory[1]);
directoryReader[2] = DirectoryReader.open(directory[2]);
if (!directoryReader[2].isCurrent()) {
directoryReader[2] = DirectoryReader.openIfChanged(directoryReader[2]);
}
MultiReader mr = new MultiReader(directoryReader);
TermStats[] stats=null;
try {
stats = HighFreqTerms.getHighFreqTerms(mr, 100, "My Term");
} catch (Exception e1) {
e1.printStackTrace();
return;
}
for (TermStats termstat : stats) {
System.out.println("IBI_body: " + termstat.termtext.utf8ToString() +
", docFrequency: " + termstat.docFreq);
}
}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.