简体   繁体   English

删除具有完全匹配的Apache Lucene中的文档

[英]Deleting a document in apache lucene having exact match

I want to delete a document in apache lucene having exact match only. 我想删除仅具有完全匹配的Apache Lucene中的文档。 for example I have documents containing text: 例如,我有包含文本的文档:

  Document1: Bilal
  Document2: Bilal Ahmed
  Doucument3: Bilal Ahmed - 54

And when Try to remove the document with query 'Bilal' it deletes all these three documents while it should delete just first document with exact match. 当尝试使用查询“ Bilal”删除文档时,它将删除所有这三个文档,而它只删除完全匹配的第一个文档。

The Code I use is this: 我使用的代码是这样的:

    String query = "bilal";
    String field = "userNames";

    Term term = new Term(field, query);

    IndexWriter indexWriter = null;

    File indexDir = new File(idexedDirectory);
    Directory directory = FSDirectory.open(indexDir);

    Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
    IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_46, analyzer);

    indexWriter = new IndexWriter(directory, iwc);        

    indexWriter.deleteDocuments(term);
    indexWriter.close();    

This is how I am indexing my documents: 这就是我索引文档的方式:

    File indexDir = new File("C:\\Local DB\\TextFiled");
    Directory directory = FSDirectory.open(indexDir);

    Analyzer  analyzer = new StandardAnalyzer(Version.LUCENE_46);
    IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_46, analyzer);              

   //Thirdly We tell the Index Writer that which document to index
   indexWriter = new IndexWriter(directory, iwc);

    int i = 0;

    try (DataSource db = DataSource.getInstance()) {

        PreparedStatement ps = db.getPreparedStatement(
                "SELECT user_id, username FROM " + TABLE_NAME + " as au" + User_CONDITION);

        try (ResultSet resultSet = ps.executeQuery()) {

            while (resultSet.next()) {
                i++;
                doc = new Document();

                text = resultSet.getString("username");                    
                doc.add(new StringField("userNames", text, Field.Store.YES));

                indexWriter.addDocument(doc);
                System.out.println("User Name : " + text + " : " + userID);
            }
        }

You have missed to provide how you index those documents. 您错过了提供如何索引那些文档的信息。 If they are indexed using StandardAnalyzer and tokenization is on, it is understandable that you get these results - this is because StandardAnalyzer tokenizes the text for each word and since each of your documents contains Bilal , you hit all those documents as a result. 如果使用StandardAnalyzer为它们建立索引并且启用了标记化,则可以理解会得到这些结果-这是因为StandardAnalyzer对每个单词的文本进行了标记化,并且由于每个文档都包含Bilal ,因此您将所有这些文档都击中了。

The general advice is that you should always add a unique id field and query/delete by this id field. 一般建议是,您应始终添加唯一的ID字段,并通过此ID字段进行查询/删除。

If you can't do this - index the same text as a separate field - without tokenization - and use phrase query to find the exact match, but this sounds like a horrible hack to me. 如果您不能执行此操作-将相同的文本作为一个单独的字段编制索引-不带标记化-并使用词组查询找到完全匹配的内容,但这对我来说听起来像是一个骇人听闻的骇客。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM