I receive vendor name and their address as string which I index in the Lucene like this, , , , . Now, from another servlet I receive text which has vendor names and the address. Example, "I have problem in using the credit card, xxxxx, in the shop , ", or, "my credit card is declined in the shop , ". I remove stop words like I, the, in, problem, shop, etc and have a clean text with " ". I need to find all those vendor(s) from the vendor_name or area inside the given text.
This is how I index the vendor details; every line in the file is a vendor and their details separated by comma. , , ,
FieldType keywordFieldType = new FieldType();
keywordFieldType.setStored(true);
keywordFieldType.setIndexed(true);
keywordFieldType.setTokenized(false);
writer = new IndexWriter(dir, iwc);
BufferedReader reader = new BufferedReader(new FileReader(
VENDOR_DETAILS));
String line = reader.readLine();
while (line != null) {
Document document = new Document();
document.add(new Field("content", line.toLowerCase(),
keywordFieldType));
writer.addDocument(document);
line = reader.readLine();
}
writer.commit();
This is how I search the index,
QueryParser queryParser = new QueryParser(VERSION, "content",
new WhitespaceAnalyzer(VERSION));
String special = "content:" + stringToQuery.trim();
try {
if (searcherManager == null) {
searcherManager = new SearcherManager(
FSDirectory.open(new File(INDEX_DIRECTORY)),
new SearcherFactory());
}
searcher = searcherManager.acquire();
TopDocs docs = searcher.search(queryParser.parse(special), 100);
int hitCount = docs.totalHits;
How do I query the Lucene to search the above requirement? What type of Query should I use to find the vendor details inside the given text?
You are adding you documents without tokenization, but are tokenizing the query, so you have a mismatch in your analysis at query vs index time. Since the field appears to be free text, tokenizing it is important to effective searching. Rather than specifying a FieldType at all, I would recommend just using TextField
. You can use WhitespaceTokenizer
on both, as you do at query time, but I would consider StandardAnalyzer
as a better starting point.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.