简体   繁体   中英

Java lucene standard analyzer`s default delimiters?

i am looking for all the delimiters on which java lucene standard analyzer tokenizes the input string.

need to know all delimiters that are by default used for tokenizing.

I know (from Lucene in Action) that all characters which are not a-zA-Z or variatons of a-zA-Z that have diacritics are used as delimiters, including numbers.
So you might have Mc'Donald splitted in "Mc" "Donald", you might have "Web2.0" tokenized as "Web", and so on.
The best is to do a test and enter all kinds of characters and then post your results here.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM