Lucene : how to sort by document count in group while doing grouping search

Question

I have documents like

{id:1, name:foo, from: China}
{id:2, name:bar, from: USA}
{id:3, name:baz, from: Japan}
{id:4, name:foo, from: China}

Then I grouping these documents by from field.

And I want to get the top N country from which users come from.

I don't know how to sort by docs count of each group. Or is there a better way to do this.

Answer 1

Maybe you can make a sort like this, new Sort(new SortField[]{new SortField("from", SortField.STRING), new SortField("id", SortField.INT)}) .

If you have other requirement, you can implement your own Collector , Lucene will collect the result use a minimum heap，and you can use a treeset which store the doc that has same from as each element in heap, and in treeset you can sort by id .

Answer 2

I don't think that using Lucene is the best way to achieve what you want, but you can iterate over the results and collect the count of each country in a Map:

final Map<String, Integer> country2count = new HashMap<String, Integer>();
    for (final ScoreDoc hit : hits) {
        final int docId = hit.doc;
        if (!reader.isDeleted(docId)) {
            // Get the document from docId
            final Document document = searcher.doc(docId);
            // Get the country
            final String country = document.get("from");

            if(country2count.containsKey(country)){
                int prevCount = country2count.get(country);
                country2count.put(country, ++prevCount);
            }else{
                country2count.put(country, 1);
            }
        }
    }

I recommend you to do not use an index but a simple log and than get the country with the highest number of users with:

cat id_name_from.log | awk '{print $3}' | sort -k 3 | uniq -c | sort -nrk 1

Example: log file saved as "id \\t name \\t from":

1   foo China
2   foo Usa
3   bar China
4   foo China
5   foo China
6   foo Usa
7   bar China
8   foo China
9   foo Usa

script:

cat log | awk '{print $3}' | sort | uniq -c | sort -nrk 1

results:

6 China
3 Usa

Answer 3

// Do a first pass (this is the "expensive" part)
String gField = "from";
Sort gSort = Sort.RELEVANCE;
int gOffset = 0;
int gLimit = 25;
TermFirstPassGroupingCollector firstCollector = new TermFirstPassGroupingCollector(...);
indexSearcher.search(query, firstCollector);
Collection<SearchGroup<BytesRef>> topGroups = firstCollector.getTopGroups(...);

//Do a second pass
Sort sortWithinGroup = new Sort(new SortField("WhateverYouWantToSortBy"...));
int offsetInGroup = 0;
int docsPerGroup = 1;
TermSecondPassGroupingCollector secondCollector = new TermSecondPassGroupingCollector(...);
indexSearcher.search(query, secondCollector);
TopGroups<BytesRef> results = secondCollector.getTopGroups(offsetInGroup);

// Do other stuff ...

Lucene : how to sort by document count in group while doing grouping search

Question

3 answers

solution1
0 2013-09-05 03:11:36

solution2
0 ACCPTED 2013-09-05 18:16:59

solution3
0 2013-10-25 13:54:04

Lucene : how to sort by document count in group while doing grouping search

Question

3 answers

solution1 0 2013-09-05 03:11:36

solution2 0 ACCPTED 2013-09-05 18:16:59

solution3 0 2013-10-25 13:54:04

solution1
0 2013-09-05 03:11:36

solution2
0 ACCPTED 2013-09-05 18:16:59

solution3
0 2013-10-25 13:54:04