简体繁体 English

Lucene搜索日本字符

[英]Lucene Search for japanese characters

原文 2010-04-15 07:17:17 3 3 c#/ asp.net/ lucene.net

I have implemented lucene for my application and it works very well unless you have introduced something like japanese characters. 我已经为我的应用程序实现了lucene，除非你介绍了像日语这样的东西，否则它的效果非常好。

The problem is that if I have japanese string こんにちは、このバイネイです and I search with こ that is the first character than it works well whereas if I use more than one japanese character(こんにち)in search token search fails and there is no document found. 问题是，如果我有日语字符串こんにちは，このバイネイです我用こ搜索是第一个字符比它效果好，而如果我在搜索令牌搜索中使用多个日语字符（こんにち）失败并且有没找到文件。

Are japanese characters supported in lucene? lucene是否支持日文字符？ what are the settings to be done to get it working? 让它运作的设置是什么？

3 个解决方案

Built-in analyzer of lucene does not support japanese. lucene的内置分析仪不支持日语。

You need to install some analyzer like sen , which is java port of mecab , quite popular japanese analyzer, and its fast. 你需要安装一些像sen这样的分析器，它是mecab的 java端口，非常受欢迎的日本分析器，它的速度很快。

There is 2 sub types called 有两种子类型叫做

CJKAnalyzer, which support chinese, and korean too, and using bi-gram method CJKAnalyzer，支持中文和韩文，并使用bi-gram方法
JapaneseAnalyzer, which only support japanese, using Morphological Analyzer and supposed to be very fast. JapaneseAnalyzer，只支持日语，使用Morphological Analyzer，应该非常快。

I don't think there can be an analyzer that will work for all languages. 我不认为可以使用适用于所有语言的分析器。 The problem is that different languages have different rules about word boundaries and stemming (for example, the Thai language doesn't use spaces at all to separate words). 问题是，不同的语言有关于单词边界和词干（例如，泰语不会在所有单独的单词使用空格）不同的规则。 Or if there is, I certainly wouldn't want to be the maintainer! 或者，如果有，我当然不想成为维护者！

What you will need to do is "tag" blocks of text as one language or another and use the correct analyzer for that particular language. 您需要做的是将文本块“标记”为一种语言或另一种语言，并使用正确的分析器来处理该特定语言。 You can attempt to detect the language "automatically" by doing character analysis (ie text using predominantly Japanese Katakana is likely Japanese) 您可以通过进行角色分析来尝试“自动”检测语言（即主要使用日语片假名的文本很可能是日语）

You should use the new Japanese analyzers recently released in Lucene 3.6.0. 您应该使用最近在Lucene 3.6.0中发布的新日本分析仪。 They are based on the excellent Kuromoji morphological analyzer recently donated to Lucene in LUCENE-3305 . 它们基于最近在LUCENE-3305中捐赠给Lucene的优秀Kuromoji形态分析仪。

Docs are a bit sparse as of this writing, so here are a few more links… 截至撰写本文时，文档有点稀疏，所以这里还有一些链接......

If you use Solr, here's a sample schema that will work on Websolr . 如果您使用Solr，这是一个可以在Websolr上运行的示例模式。
Slides from my presentation at the 20 Apr 2012 herokujp meetup, on full-text search with an emphasis on analyzing Japanese. 我在2012年4月20日herokujp聚会上的演讲中进行了全文搜索，重点是分析日语。