简体   繁体   中英

Invalid results when searching emails using elasticsearch with Tire and Ruby on Rails

I'm trying index and search by email using Tire and elasticsearch.

The problem is that if I search for: "something@example.com". I get strange results because of @ and . symbols. I "solved" by hacking the query string and adding "email:" before a string I suspect is a string. If I don't do that, when searching "something@example.com", I would get results as "something@gmail.com" or "asd@example.com".

include Tire::Model::Search
include Tire::Model::Callbacks

settings :analysis =>{
          :analyzer => {
            :whole_email => {
              'tokenizer' => 'uax_url_email'
            }
          }
  } do
  mapping do
    indexes :id
    indexes :email, :analyzer => 'whole_email', :boost => 10
  end
end

def self.search(params)
  params[:query] = params[:query].split(" ").map { |x| x =~ EMAIL_REGEXP ? "email:#{x}" : x }.join(" ")
  tire.search(load: {:include => {'event' => 'organizer'}}, page: params[:page], per_page: params[:per_page] || 10) do
    query do
      boolean do
        must { string params[:query] } if params[:query].present?
        must { term :event_id, params[:event_id]  } if params[:event_id].present?
      end
    end
    sort do
      by :id, 'desc'
    end
  end
end

def to_indexed_json
  self.to_json
end

When searching with "email:" the analyzer works perfectly but without it, it search that string in email without the specified analyzer, getting lots of undesired results.

I think your issue is to do with the _all field. By default, all fields get indexed twice, once under their field name, and again, using a different analyzer, in the _all field.

If you send a query without specifying which field you are searching in, then it will be executed against the _all field. When you index your doc, the email fields content is indexed again under the _all field (to stop this set include_in_all: false in your mapping) where they are tokenized the standard way (split on @ and .). This means that unguided queries will give strange results.

The way I would fix this is to use a term query for the emails and make sure to specify the field to search on. A term query is faster as it doesn't have a query parsing step the query_string query has (which is why when you prefix the string with "email:" it goes to the right field, that's the query parser working). Also you don't need to specify a custom analyzer unless you are indexing a field that contains both free text and urls and emails. If the field only contains emails then just set index: not_analyzed and it will remain a single token. (You might want to have a custom analyzer that lowercases the email though.)

Make your search query like this:

"term": {
    "email": "example@domain.com"
}

Good luck!

Add the field to _all and try search with adding escape character(\\) to special characters of emailid.

example: something\\@example\\.com

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM