简体   繁体   中英

Elasticsearch: aggregate by similar substrings

I have a index of documents with only one property each. The records are like

Products Sport
Products Health
Products Home
Questions CSS
Questions HTML
Questions JS

There are a lot of documents an a lot of duplicates. The question is can I somehow group them by "similarity" (in any sense) and add the "common part" to each document, so I will have something like

Products Sport         Products
Products Health        Products
Products Home          Products
Questions CSS          Questions
Questions HTML         Questions
Questions JS           Questions

It's just for analysis purposes, so it can be very inaccurate, but should be quick enough.

What you are looking for is _update_by_query. Something like this for each category to add a field named category and set it's value using scripts

POST index/_update_by_query? conflicts=proceed
{
  "script": {
   "source": "ctx._source['category']='Products'",
    "lang": "painless"
 },
  "query": {
    "exists": {
      "field": "Products"
    }
  }
}

Alternative: If you are looking to just perform group by for results, then you can use the exists query clause to get the documents of certain type and then perform aggregations on them with out updating the documents

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM