简体   繁体   English

使用Lucene空间搜索/ DateRangePrefixTree的日期范围查询?

[英]Date range query using Lucene spatial search / DateRangePrefixTree?

I'm using Lucene 6.3, but I am not able to figure out what is wrong with the following very basic search query. 我正在使用Lucene 6.3,但我无法弄清楚以下非常基本的搜索查询有什么问题。 It simply adds to documents each with a single date range and then tries to search on a greater range the should find both documents. 它只是添加每个具有单个日期范围的文档,然后尝试搜索应该找到两个文档的更大范围。 What is wrong? 怎么了?

There are inline comments which should make the exmaple pretty self explanatory. 有内联注释应该使exmaple非常自我解释。 Let me know if anything is unclear. 如果有什么不清楚,请告诉我。

Please note that my main requirement is being able to to perform date range query along side other field queries such as 请注意,我的主要要求是能够与其他字段查询一起执行日期范围查询,例如

text:interesting date:[2014 TO NOW]

This is after watching the Lucene spatial deep dive video introduction which introduces the framework on which DateRangePrefixTree and strategies are based. 这是在观看了Lucene空间深度视频介绍之后,介绍了DateRangePrefixTree和策略所基于的框架。

Rant: It feels like if I am making any mistakes here that I should get some validation errors, either on the query or on the writing, given how simplistic my example is. Rant:感觉好像我在这里犯了任何错误,我应该在查询或写作上得到一些验证错误,因为我的例子是多么简单。

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.*;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.*;
import org.apache.lucene.spatial.prefix.NumberRangePrefixTreeStrategy;
import org.apache.lucene.spatial.prefix.PrefixTreeStrategy;
import org.apache.lucene.spatial.prefix.tree.DateRangePrefixTree;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.junit.Before;
import org.junit.Test;

import java.io.IOException;
import java.util.Calendar;
import java.util.Date;


public class TestLuceneDatePrefix {

  /*
  All these names should be lower case as field names are case sensitive in Lucene.
   */
  private static final String NAME = "name";
  public static final String TIME = "time";


  private Directory directory;
  private StandardAnalyzer analyzer;
  private ScoreDoc lastDocOnPage;
  private IndexWriterConfig indexWriterConfig;

  @Before
  public void setup() {
    analyzer = new StandardAnalyzer();
    directory = new RAMDirectory();
    indexWriterConfig = new IndexWriterConfig(analyzer);
  }


  @Test
  public void testAddDocumentAndSearchByDate() throws IOException {

    IndexWriter w = new IndexWriter(directory, new IndexWriterConfig(analyzer));

    // Responsible for creating the prefix string / geohash / token to identify the date.
    // aka Create post codes
    DateRangePrefixTree prefixTree = new DateRangePrefixTree(DateRangePrefixTree.JAVA_UTIL_TIME_COMPAT_CAL);

    // Strategy indexing the token.
    // aka transform post codes into tokens that make them efficient to search.
    PrefixTreeStrategy strategy = new NumberRangePrefixTreeStrategy(prefixTree, TIME);


    createDocument(w, "Bill", new Date(2017,1,1), prefixTree, strategy);
    createDocument(w, "Ted", new Date(2018,1,1), prefixTree, strategy);

    w.close();

    // Written the document, now try query them

    DirectoryReader reader;
    try {
      QueryParser queryParser = new QueryParser(NAME, analyzer);
      System.out.println(queryParser.getLocale());

      // Surely searching only on year for the easiest case should work?
      Query q = queryParser.parse("time:[1972 TO 4018]");

      // The following query returns 1 result, so Lucene is set up.
      // Query q = queryParser.parse("name:Ted");
      reader = DirectoryReader.open(directory);
      IndexSearcher searcher = new IndexSearcher(reader);

      TotalHitCountCollector totalHitCountCollector = new TotalHitCountCollector();

      int hitsPerPage = 10;
      searcher.search(q, hitsPerPage);

      TopDocs docs = searcher.search(q, hitsPerPage);
      ScoreDoc[] hits = docs.scoreDocs;

      // Hit count is zero and no document printed!!

      // Putting a dependency on mockito would make this code harder to paste and run.
      System.out.println("Hit count : "+hits.length);
      for (int i = 0; i < hits.length; ++i) {
        System.out.println(searcher.doc(hits[i].doc));
      }
      reader.close();
    }
    catch (ParseException e) {
      e.printStackTrace();
    }
  }


  private void createDocument(IndexWriter w, String name, Date fromDate, DateRangePrefixTree prefixTree, PrefixTreeStrategy strategy) throws IOException {
    Document doc = new Document();

    // Store a text/stored field for the name. This helps indicate that Lucene is orking.
    doc.add(new TextField(NAME, name, Field.Store.YES));

    //offset toDate
    Calendar cal = Calendar.getInstance();
    cal.setTime( fromDate );
    cal.add( Calendar.DATE, 1 );
    Date toDate = cal.getTime();

    // This lets the prefix tree create whatever tokens it needs
    // perhaps index year, date, second etc separately, hence multiple potential tokens.
    for (IndexableField field : strategy.createIndexableFields(prefixTree.toRangeShape(
        prefixTree.toUnitShape(fromDate), prefixTree.toUnitShape(toDate)))) {
      // Debugging the tokens produced is difficult as I can't intuitively look at them and know if they are valid.
      doc.add(field);
    }
    w.addDocument(doc);
  }
}

Update: 更新:

  • I thought maybe the answer was to use SimpleAnalyzer compared to StandardAnalyzer, but this doesn't appear to work either. 我想也许答案是使用SimpleAnalyzer与StandardAnalyzer相比,但这似乎也不起作用。

  • My requirement of being able to parse user date range's does seem to be catered by SOLR , so I would expect this to be based on Lucene functionality. 我对能够解析用户日期范围的要求似乎是由SOLR提供的 ,所以我希望这是基于Lucene的功能。

The QueryParser is not going to be useful for searching on spatial fields, and the analyzer isn't going to make any difference. QueryParser对于搜索空间字段没有用,并且分析器不会有任何区别。 Analyzers are designed to tokenize and transform text . 分析器旨在标记和转换文本 As such, they aren't used by spatial fields. 因此,它们不被空间字段使用。 Similarly, the QueryParser is primarily geared around text searching, and has no support for spatial queries. 同样,QueryParser主要针对文本搜索,不支持空间查询。

You'll need to query using a spatial query. 您需要使用空间查询进行查询。 In particular, the subclasses of AbstractPrefixTreeQuery will be useful. 特别是, AbstractPrefixTreeQuery的子类将很有用。

For instance, if I want to query for documents whose time field is a range that contains the years 2003 - 2005, I could create a query like: 例如,如果我想查询时间字段是包含 2003 - 2005年的范围的文档,我可以创建一个类似的查询:

Shape queryShape = prefixTree.toRangeShape(
    prefixTree.toUnitShape(new GregorianCalendar(2003,1,1)), 
    prefixTree.toUnitShape(new GregorianCalendar(2005,12,31)));

Query q = new ContainsPrefixTreeQuery(
          queryShape,
          "time",
          prefixTree,
          10,
          false
  );

So this would match a document that had been indexed, for instance, with the range 2000-01-01 to 2006-01-01. 因此,这将匹配已编入索引的文档,例如,范围为2000-01-01到2006-01-01。

Or to go the other way and match all documents whose ranges fall entirely within the query range: 或者以相反的方式匹配范围完全查询范围的所有文档:

Shape queryShape = prefixTree.toRangeShape(
    prefixTree.toUnitShape(new GregorianCalendar(1990,1,1)), 
    prefixTree.toUnitShape(new GregorianCalendar(2020,12,31)));

Query q = new WithinPrefixTreeQuery(
          queryShape,
          "time",
          prefixTree,
          10,
          -1,
          -1
  );

Note on arguments: I don't really understand some of the parameters to these queries, particularly detailLevel and prefixGridScanLevel. 关于参数的注意事项:我并不真正理解这些查询的一些参数,特别是detailLevel和prefixGridScanLevel。 Haven't found any documentation on how exactly they work. 没有找到任何关于它们如何工作的文档。 These values seem to work in my basic tests, but I don't know what the best choices would be. 这些值似乎在我的基本测试中起作用,但我不知道最佳选择是什么。

Firstly QueryParser can parse dates and produce a TermRangeQuery by default. 首先,QueryParser可以解析日期并默认生成TermRangeQuery。 See the following method of the default parser which produces a TermRangeQuery. 请参阅生成TermRangeQuery的默认解析器的以下方法。

org.apache.lucene.queryparser.classic.QueryParserBase#getRangeQuery(java.lang.String, java.lang.String, java.lang.String, boolean, boolean)

This assumes that you'll be storing dates as strings in the lucene database, which is a little inefficient but works straight out the box, provided a SimpleAnalyzer or equivalent is used. 这假设你将把日期作为字符串存储在lucene数据库中,如果使用SimpleAnalyzer或等效的话,这会有点效率低但是可以直接使用。

Alternatively you can store the dates as LongPoint which would be the most efficient for the date scenario as per my question above where a date is a point in time and one date stored per field. 或者,您可以将日期存储为LongPoint,这对于日期场景来说是最有效的,根据我上面的问题,其中日期是一个时间点,每个字段存储一个日期。

Calendar fromDate = ...
doc.add(new LongPoint(FIELDNAME, fromDate.getTimeInMillis()));

but here like suggested for DatePrefixTree, this requires writing hard coded queries. 但这里建议使用DatePrefixTree,这需要编写硬编码查询。

Query pointRangeQueryHardCoded = LongPoint.newRangeQuery(FIELDNAME, fromDate.getTimeInMillis(), toDate.getTimeInMillis());

It is possible to reuse QueryParser even here, if the following method is overridden with a version that produces a LongPoint range query. 如果使用生成LongPoint范围查询的版本覆盖以下方法,则可以在此处重用QueryParser。

org.apache.lucene.queryparser.classic.QueryParserBase#newRangeQuery(java.lang.String, java.lang.String, java.lang.String, boolean, boolean)

This can also be done for the datePrefix tree version, but this scheme is only worthwhile if: 这也适用于datePrefix树版本,但只有在以下情况下才能使用此方案:

  • You wanted to search by some unusual token (I believe it could accommodate Mondays). 你想通过一些不寻常的令牌进行搜索(我相信它可以容纳星期一)。
  • You had multiple dates per document field. 每个文档字段有多个日期。
  • You were storing date ranges which needed to be queried over. 您正在存储需要查询的日期范围。

Adapting the query parser to have a convenient lingo that captures all relevant scenarios I imagine would be a fair amount of work for this last case. 调整查询解析器以获得方便的术语,捕获我想象的所有相关场景,这对于最后一种情况来说是相当多的工作。

Additionally please be careful not to mix Date(YEAR, MONTH, DAY) with GregorianCalendar(YEAR, MONTH, DAY) as the arguments are not equal and will cause problems. 另外请注意不要将日期(年,月,日)与GregorianCalendar(年,月,日)混合,因为参数不相等会导致问题。

See java.util.Date#Date(int, int, int) for how different the arguments are and why this constructor is deprecated. 请参阅java.util.Date#Date(int, int, int) ,了解参数的不同之处以及不推荐使用此构造函数的原因。 This caught me out as per the code in the question. 根据问题中的代码,这引起了我的注意。

Thanks again to femtoRgon for pointing out the mechanics of the spatial search, but in the end this wasn't the way for me to go. 再次感谢femtoRgon指出空间搜索的机制,但最终这不是我的方式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM