简体   繁体   中英

Java Lucene: Search for terms that include non-alphanumeric characters

I need to be able to return results using termDocs and Term's. I am not returning any results when I use standard analyser, any ideas on other analysers avaliable to perform all same operations as standard analyser and return results using terms like (example term- #define):

      analyser = new StandardAnalyser(Version.LUCENE_30);
      reader = IndexReader.open(FSDirectory.open(IndexDir), true);
      TermDocs td = reader.termDocs();
      QueryParser parserContents = new QueryParser(Version.LUCENE_30,field,analyser);
      query = parserContents.parse(searchTerm);  
      docs = search.search(query, 100000);
      ScoreDoc[] documents = docs.scoreDocs;
      for(ScoreDoc match : documents)   
      {
      td.seek(new Term(field,w));
      td.skipTo(match.doc);
      hits = td.freq();
      }

However I do get results when I am trying to use queryparser, and not termdocs. The hits are always zero in above context for terms like #define(special character #).

The StandardAnalyzer does a lot of pre-processing of tokens (it uses a stop list, removes non-alpha characters, lower-cases, etc.) so that probably accounts for what you're seeing in your search results. Try analyzing same field with the SimpleAnalyzer or maybe even the WhitespaceAnalyzer to see what you get. That might give you enough experience with the results to know whether one of these analyzers is adequate, or how to build your own that specifies the exact tokenizing operations you need. You might also want to add more than one field with the same values which were processed with different analyzers. That way, for example, you could search for stemmed and unstemmed text, for text with or without the stop words removed, with or without the special characters included, etc.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM