简体   繁体   中英

Standard analyzer that doesn't tokenize some words/patterns

So, if suppose there is a line like this:

> Mar 14 20:22:41 subdomain.mydomain.colo postfix/smtpd[16862]: NOQUEUE:
> reject: RCPT from unknown[1.2.3.4]: 450 4.7.1 Client host rejected:
> cannot find your reverse hostname, [5.6.7.8]; from=<erp@misms.net.in>
> to=<a@domain1.com> proto=ESMTP helo=<a.domain.net> also
> from=<>

There are few problems with using standard tokenizer.

  • If I have standard tokenizer, I can't search for from=<> .
  • To do this, whitespace tokenizer works flawlessly. But, at the same time I won't be able to search for a emailid a@domain1.com or the domain a.domain.net because they contain <> characters with them. I would want a@domain1.com to be as one token
  • If I use standard tokenizer, I can search for a@domain1.com but these are actually two tokens (so I think it's inefficient).
  • Standard tokenizer breaks the hostname subdomain.mydomain.colo which I don't want.

So, is there a way to analyze text such that it uses standard tokenizer but also doesn't tokenize the words that match a regex? I am a newbie in ES so if possible please try to give a small example, that would be awesome.

I feel that regex related tokenizer can be expensive so if there is a change that I can do whitespace analyzer + also keeping tokens like hostnames, emailids and also preserve few words,that would be awesome.

Please answer with any kind of inputs you have.

PS: I had a look at this post in ES mailing list but it won't work with email addresses or hostnames because I can't have a exhaustive list of all emailaddresses/hostnames. So, I hope you understand my requirement.

There have been some major changes to StandardAnalyzer in Lucene 4.X. Rather than the old logic, it now implements UAX#29 .

The old style of StandardAnalyzer has been renamed ClassicAnalyzer, which uses a ClassicTokenizer , which should do most of what you want (it is designed explicitly to handle e-mail addressees and hostnames as single tokens).

However, I don't believe it will help you parse from=<> as a token. For that, I see a couple of options:

  • Change the data: Since it's a very specific string you want to recognize, just replace all instances of it with a single token you can easily recognize, such as replace from=<> with NULLSENDER , then index it.
  • Create a custom tokenizer to handle your grammar. Probably the easiest way, and what the Lucene API recommends, would be to copy the ClassicTokenizer code, and work from there.

Considering the apparent simplicity of the special case you want to handle there, I would probably use the first option, since it should be pretty easy, and the second option might be more trouble than it's worth.

I think adding multi-mapping with different analyzers would simplify it by creating separate cases to handle the different scenarios:

    "myfield": {
            "type": "multi_field",
            "fields": {
                "myfield": {
                    "include_in_all": true,
                    "type": "string",
                    "index": "analyzed",
                    "analyzer": "myWhitespaceAnalyzer"
                },
                "variant1": {
                     "include_in_all": true,
                    "type": "string",
                    "index": "analyzed",
                    "analyzer": "myOtherAnalyzer"
                },
                 "untouched": {
                     "include_in_all": true,
                    "type": "string",
                    "index": "not_analyzed"
                }
            }
        }

The either search all field, or specific fields based on your needs.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM