简体   繁体   中英

Lucene $search pipeline in mongodb atlas

I cant make the following work in Atlas using $search pipeline.

Problem

  • If we search below with query = "John" only document with "John" are returned
  • if we search with "John Doe" then we have way too much document: returned document are the one with John OR Doe.

We need to be able to search in field index with query like 'John Doe' ond only get document containing 'John Doe' in entity field 'index'.

I have lot of mongo entities with following model (with lots of different names than john doe), here is one of these entity:

{
    "_id" : "1b85cbe3-d0f4-44ee-a9fd-f9b81152891d",
    "aList" : [
        {
            "index" : [
                "John Doe 10001 New York",
                "Jane Doe 10001 New York"
            ]
        }
    ],
    "anotherList" : [
      {
            "index" : [
                "John Doe 10001 New York",
                "John Doe 10001 New York"
            ],
      }
    ]
}

I have create a lucene index in atlas with the following json

{
  "mappings": {
    "dynamic": false,
    "fields": {
      "anotherList": {
        "fields": {
          "index": [
            {
              "dynamic": true,
              "type": "document"
            },
            {
              "multi": {
                "frenchAnalyzer": {
                  "analyzer": "lucene.french",
                  "searchAnalyzer": "lucene.french",
                  "type": "string"
                },
                "germanAnalyzer": {
                  "analyzer": "lucene.german",
                  "searchAnalyzer": "lucene.german",
                  "type": "string"
                },
                "italianAnalyzer": {
                  "analyzer": "lucene.italian",
                  "searchAnalyzer": "lucene.italian",
                  "type": "string"
                }
              },
              "type": "string"
            }
          ]
        },
        "type": "document"
      },
      "aList": {
        "fields": {
          "index": [
            {
              "dynamic": true,
              "type": "document"
            },
            {
              "multi": {
                "frenchAnalyzer": {
                  "analyzer": "lucene.french",
                  "searchAnalyzer": "lucene.french",
                  "type": "string"
                },
                "germanAnalyzer": {
                  "analyzer": "lucene.german",
                  "searchAnalyzer": "lucene.german",
                  "type": "string"
                },
                "italianAnalyzer": {
                  "analyzer": "lucene.italian",
                  "searchAnalyzer": "lucene.italian",
                  "type": "string"
                }
              },
              "type": "string"
            }
          ]
        },
        "type": "document"
      }
    }
  }
}

Now when I run an aggregate searching for John, I get

{
  "index": "IndexKundensuche",
  "text": {
    "query": "John",
    "path": [
      {
        "value": "aList.index",
        "multi": "frenchAnalyzer"
      },
      {
        "value": "aList.index",
        "multi": "germanAnalyzer"
      },
      {
        "value": "aList.index",
        "multi": "italianAnalyzer"
      },
      {
        "value": "anotherList.index",
        "multi": "frenchAnalyzer"
      },
      {
        "value": "anotherList.index",
        "multi": "germanAnalyzer"
      },
      {
        "value": "anotherList.index",
        "multi": "italianAnalyzer"
      }
    ]
  }
}

only document containing "John" back. if we search with query = "John Doe" then we have way too much document: returned documents are the one with "John" OR "Doe" and not ordered.

The Sorting is already present, but we do not know how the Score is calculated, because eg if we search with postal code Score is rated partially higher but the first document is not the one we expect.

if you have a middle name for example (Jon-Ben Doe) and search for Jon Doe, other results come up with Doe before Jon....

What am I doing wrong? is it supported by Atlas $search? is my $search query wrong? Are we forced to split the $search pipeline (search for John then Doe then....).

Thanks

We found the correct answer:

Atlas Search Custom Analyzer offers the option of using a token filter (icuFolding) to hide diacritical marks (accents etc.) and to ignore upper and lower case letters.

Create an Atlas Lucene index named 'diacritic_folding_index' using json editor (name free to choose) and select your collection

{
  "mappings": {
    "fields": {
      "aList": {
        "fields": {
          "index": [
            {
              "analyzer": "diacriticFolder",
              "type": "string"
            }
          ]
        },
        "type": "document"
      },
      "anotherList": {
        "fields": {
          "index": [
            {
              "analyzer": "diacriticFolder",
              "type": "string"
            }
          ]
        },
        "type": "document"
      }
    }
  },
  "analyzers": [
    {
      "charFilters": [],
      "name": "diacriticFolder",
      "tokenFilters": [
        {
          "type": "icuFolding"
        }
      ],
      "tokenizer": {
        "type": "keyword"
      }
    }
  ]
}

The diacritic-insensitive Lucene search index is used in a wildcard search (adapt index name 'diacritic_folding_index' if changed above)

When index is created (wait for 100%) go to your collection, in Aggregation tab

Create a new pipeline, use at first step $search then build your query, here paste for example

{
    $search: {
        "index": "diacritic_folding_index",
        "wildcard": {
            "path": ["aList.index", "anotherList.index"],
            "query": "Cédric Walter*",
            "allowAnalyzedField": true
        }
    }
}

When you are satisfied, use export button, select JAVA/.NET/... and paste in your code!

I believe you want to to use the "phrase" operator described here:

https://docs.atlas.mongodb.com/reference/atlas-search/phrase/

This operator searches for an ordered sequence, where the text operator does not consider order.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM