So, if suppose there is a line like this:
> Mar 14 20:22:41 subdomain.mydomain.colo postfix/smtpd[16862]: NOQUEUE:
> reject: RCPT from unknown[1.2.3.4]: 450 4.7.1 Client host rejected:
> cannot find your reverse hostname, [5.6.7.8]; from=<erp@misms.net.in>
> to=<a@domain1.com> proto=ESMTP helo=<a.domain.net> also
> from=<>
There are few problems with using standard tokenizer.
from=<>
. a@domain1.com
or the domain a.domain.net
because they contain <>
characters with them. I would want a@domain1.com
to be as one token a@domain1.com
but these are actually two tokens (so I think it's inefficient). So, is there a way to analyze text such that it uses standard tokenizer but also doesn't tokenize the words that match a regex? I am a newbie in ES so if possible please try to give a small example, that would be awesome.
I feel that regex related tokenizer can be expensive so if there is a change that I can do whitespace analyzer + also keeping tokens like hostnames, emailids and also preserve few words,that would be awesome.
Please answer with any kind of inputs you have.
PS: I had a look at this post in ES mailing list but it won't work with email addresses or hostnames because I can't have a exhaustive list of all emailaddresses/hostnames. So, I hope you understand my requirement.
There have been some major changes to StandardAnalyzer in Lucene 4.X. Rather than the old logic, it now implements UAX#29 .
The old style of StandardAnalyzer has been renamed ClassicAnalyzer, which uses a ClassicTokenizer , which should do most of what you want (it is designed explicitly to handle e-mail addressees and hostnames as single tokens).
However, I don't believe it will help you parse from=<>
as a token. For that, I see a couple of options:
from=<>
with NULLSENDER
, then index it. Considering the apparent simplicity of the special case you want to handle there, I would probably use the first option, since it should be pretty easy, and the second option might be more trouble than it's worth.
I think adding multi-mapping with different analyzers would simplify it by creating separate cases to handle the different scenarios:
"myfield": {
"type": "multi_field",
"fields": {
"myfield": {
"include_in_all": true,
"type": "string",
"index": "analyzed",
"analyzer": "myWhitespaceAnalyzer"
},
"variant1": {
"include_in_all": true,
"type": "string",
"index": "analyzed",
"analyzer": "myOtherAnalyzer"
},
"untouched": {
"include_in_all": true,
"type": "string",
"index": "not_analyzed"
}
}
}
The either search all field, or specific fields based on your needs.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.