简体   繁体   中英

How do you configure Lucene in Sitecore to only index the latest version of an item on the master db?

I recognise this is a moot point on the web database, so this question applies to the master db...

I have a custom index set up in Sitecore 6.4.1 as follows:

<index id="search_content_US" type="Sitecore.Search.Index, Sitecore.Kernel">
    <param desc="name">$(id)</param>
    <param desc="folder">_search_content_US</param>
    <Analyzer ref="search/analyzer" />
    <locations hint="list:AddCrawler">
        <search_content_home type="Sitecore.Search.Crawlers.DatabaseCrawler, Sitecore.Kernel">
            <Database>master</Database>
            <Root>/sitecore/content/usa home</Root>
            <Tags>home content</Tags>
        </search_content_home>
    </locations>
</index>

I query the index like this (I am using techphoria414's SortableIndexSearchContext from this answer: How to sort/filter using the new Sitecore.Search API ):

private SearchHits GetSearchResults(SortableIndexSearchContext searchContext, string searchTerm)
    {
        CombinedQuery query = new CombinedQuery();
        query.Add(new FullTextQuery(searchTerm), QueryOccurance.Must);
        return searchContext.Search(query, Sort.RELEVANCE);
    }

...

SearchHits hits = GetSearchResults(searchContext, searchTerm);

hits is a collection of search hits from my index. When I iterate through hits I can see that there are many duplicates of the same items in Sitecore, 1 per version of the item.

I then do the following to get a SearchResultCollection :

SearchResultCollection results = hits.FetchResults(0, hits.Length);

This combines all of the duplicates into a single SearchResult object. This object represents 1 version of a particular item, and has a property called SubResults which is a collection of SearchResult s that represent all of the other item versions.

Here's my problem:

The version of the item represented by the SearchResult is NOT the current published version of the item! It appears to be a randomly selected version (whichever the search method hit first in the index). The latest version is included in the SubResults collection, however.

Eg:

SearchResult
 |
 |- Version 8 // main result
 ...
 |- SubResults
      |
      |- Version 9 // latest version
      |- Version 3
      |- Version 5
      ... // all versions in random order

How do I prevent this from happening on the master db? Either by preventing Lucene from indexing old versions of items, or by doing some manipulation of the result set to get the latest version from the SubResults ?

As an aside, why does Lucene bother to index old versions of items anyway? Surely this is pointless for searching content on your website as the old versions are not visible?

You can implement a custom crawler that overrides the following:

public class IndexCrawler : DatabaseCrawler
{
    protected override void IndexVersion(Item item, Item latestVersion, Sitecore.Search.IndexUpdateContext context)
    {
        if (item.Versions.Count > 0 && item.Version.Number != latestVersion.Version.Number)
            return;

        base.IndexVersion(item, latestVersion, context);
    }
}

This ensures that only the latest version of an item gets into your Index, and therefore will be the only item pull out of said index

You would need to update your configuration file to set the correct type for the index of course

Sitecore 7中 ,字段_latestversion被添加到索引中,包含最新版本的“1”(其他版本具有空值)。

如果您让Lucene在您的Web数据库而不是Master中进行搜索,则它应该仅对最后发布的版本编制索引。

<Database>web</Database>

Although the solution provided by theyetiman, by using an adjusted sort mechanism, is an interesting approach, it does not provide a perfect solution when the Lucene result scores for the two versions tend to differ. Eg out of v1 with score 0.7, and v2 with score 0.5, his solution will still return the first version of the item. (At least in my tests.)

After some more digging, the most obvious solution apparently lies in implementing your own Sitecore.Pipelines.Search.SearchSystemIndex and using that one instead of the default. If you decompile that code using ILSpy or similar, you will notice the following at the bottom of the Process method:

foreach (SearchResult current in searchHits.FetchResults(0, searchHits.Length)){
  // ...
}

Each such SearchResult is actually group-by, where the first result that was returned from Lucene (thus the one with the highest score) is the main result. Hits on other versions (and also other languages) of the same item are accessible through the Subresults property of each instance; or null when there are none.

Depending on your requirements, you can adjust this part of the class to fit your needs.

Whilst I haven't figured out the exact answer (to stop Lucene indexing old versions on the master db ) I have come up with an acceptable work-around...

When Lucene returns its results from the index, each hit has a field called "_id" which is formatted something like this (3 versions of the same item, where the last number is the version):

"CCB75380-4E9A-4921-99EC-65E532E330FF%en%1"
"CCB75380-4E9A-4921-99EC-65E532E330FF%en%2"
"CCB75380-4E9A-4921-99EC-65E532E330FF%en%3"
...

I'm currently sorting by Sort.RELEVANCE which is the default. This is fine if we only had one version of an item in the index, but with several almost identical versions, they all have the same relevance score and Lucene just churns them out in any order. Sitecore then takes the first instance of the item version (even if it's old).

The solution is to specify a secondary sort field. In the searchContext.Search() method, you can pass a custom Sort object.

searchContext.Search(query, new Sort(...));

By sorting by Lucene's built in Sort.RELEVANCE first, and then by the id field (descending) in the index, I can ensure that the first hit that Sitecore sees will be the latest version and not just a random one:

searchContext.Search(query, new Sort
                            (
                                new SortField[2] 
                                {
                                    SortField.FIELD_SCORE, // equivalent to Sort.RELEVANCE
                                    new SortField("_id",SortField.STRING, true) // sort by _id, descending
                                }
                            )
);

The SortField parameters are as follows:

SortField(string fieldName, int type, bool reverse)

This approach has fixed my problem, but if anyone can actually find out how to only index the latest version, please answer!

I ended up figuring out an alternate solution from the above answers,

Architecturally speaking, I think the ideal solution for this problem would be to filter out the older version results using custom code at higher level rather than removing them from the master database index altogether. you don't want to manage the way sitecore is designed to work to solve problem at hand.

Use below predicate to filter out the olderversions and retrieve only latest version

predicate.And(item=>item[Sitecore.ContentSearch.BuiltinFields.LatestVersion].Equals("1"));

Hope this helps someone !

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM