简体   繁体   中英

Lucene calculate average term frequency

I am currently implementing a modification of Lucene's standard BM25 similarity , based on the following paper . The implementation of the actual formula is straightforward, but I am struggling with the computation of the necessary statistics.

I need the following two statistics:

  • Average term frequency of a document: length of document / # unique terms of the document , (ie an indicator of the repetitiveness of a document - for a document with no repetitions this would be 1, with each term occuring twice it would be 2 and so on)
  • Mean average term frequency : This is the arithmetic mean of the above measure over all documents of the collection. This can be seen as the average repetitiveness of the whole corpus.

I found out, that I can calculate the per-document average term frequency at indexing time by overriding the computeNorm method of my Similarity implementation. I can store the value alongside the norm value by bit-operations (not exceptionally pretty, but so far it works). At query-time I can then extract the document's average term frequency and length.

However, this does not help finding the mean average term frequency. It is obviously a collection-wide value and should therefore be computed in Similarity.computeWeight as far as I understand, but I don't see how this could be done given the arguments of the function.

Which would be the ideal place for calculating these statistics?

I am new to Lucene, so it may be that there is an obvious solution which I did not yet see. I am grateful for any input.

Similarity.computeWeight方法具有CollectionStatistics参数,该参数包含maxDoc (返回文档总数,而不管它们是否都包含此字段的值),以及TermStatistics ,其中包含termtotalTermFreq (返回该术语出现的总数)通过除法可以得到平均词频

You'll need to calculate your own "norm" to stick in Lucene's index. Basically, you can store additional features to use in your scoring using NumericDocValuesField.

This means, at index time, you'll want to tokenize your text yourself. I have some example code, (in Kotlin, but happy to answer follow-up questions if you prefer Java)

Tokenize based on any Lucene analyzer: (Expressed as a Kotlin extension function, just imagine that this is the first argument to this static method as an Analyzer if you're more comfortable in Java.

fun Analyzer.tokenize(field: String, input: String): List<String> {
    val tokens = arrayListOf<String>()
    this.tokenStream(field, input).use { body ->
        val charTermAttr = body.addAttribute(CharTermAttribute::class.java)

        // iterate over tokenized field:
        body.reset()
        while(body.incrementToken()) {
            tokens.add(charTermAttr.toString())
        }
    }
    return tokens
}

Then you take the tokenized text, and calculate the information you need based on it. The code I'm using wants them separate, but something like this should get you going.

    fun setTextField(field: String, text: String, terms: List<String>): List<IndexableField> {
        val length = terms.size
        val uniqLength = terms.toSet().size

        val keep = ArrayList<IndexableField>()
        keep.add(TextField(field, text, Field.Store.YES))
        keep.add(NumericDocValuesField("lengths:$field", length.toLong()))
        keep.add(NumericDocValuesField("unique:$field", uniqLength.toLong()))
        return keep
    }

This is a per-document statistic, so you can keep track of the mean while indexing and store it separately from Lucene, ie I usually create a "meta.json" near the index for these kinds of things.

I'm not familiar with SOLR, per-se, but when you go to implement a Weight subclass in Lucene, you have access to these numeric doc values as follows:

class SpecialBM25(...) : Weight(...) {
    ...
    override fun scorer(context: LeafReaderContext): Scorer {
        val uniq = context.reader().getNumericDocValues("unique:$field")
        val lengths = context.reader().getNumericDocValues("lengths:$field")
        ... generate Scorer and give it your additional features ...
    }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM