简体   繁体   中英

How to Troubleshoot Mecab Parser Dysfunction

BACKGROUND : I have built a custom search engine that works fine in English, but fails in Japanese, this despite confirmation from my host server that I have performed the installation of the Japanese mecab parser correctly. My own checks reveal the following:

1) SHOW CREATE TABLE :

FULLTEXT KEY search_newsletter ( letter_title , letter_abstract , letter_body ) /*!50100 WITH PARSER mecab */ ) ENGINE=InnoDB AUTO_INCREMENT=5 DEFAULT CHARSET=latin1

2) SHOW PLUGINS :

ngram | ACTIVE | FTPARSER | NULL | GPL | mecab | ACTIVE | FTPARSER | libpluginmecab.so | GPL

IMPLEMENTATION

1) MYSQL Statement :

$sql ="SELECT letter_no, letter_lang, letter_title, letter_abstract, submission_date, revision_date, MATCH (letter_title, letter_abstract, letter_body) AGAINST (? IN NATURAL LANGUAGE MODE) AS letter_score FROM sevengates_letter WHERE MATCH (letter_title, letter_abstract, letter_body) AGAINST (? IN NATURAL LANGUAGE MODE) ORDER BY letter_score DESC";

2) CUSTOM SEARCH ENGINE :

See under Local Search / Newsletters at https://www.grammarcaptive.com/overview.html

3) DOCUMENT SEARCHED :

See under Regular Updates / Newsletter / Archives / Japanese at https://www.grammarcaptive.com/overview.html

COMMENT : Neither PHP, nor MySQL complains. Simply any Japanese word search that needs to be parsed is not returned. For example, the word 日本語 can be search and found, but does not require any parsing to be retrieved. The search for any other Japanese word in the newsletter fails.

REQUEST : Any troubleshooting tips would be greatly appreciated.

Roddy

A couple of things you can check:

Does Mecab work on the command line?

You should be able to do something like this, assuming a linux-like system:

echo "日本語ですよ" | mecab

Output should be roughly like this (details will probably differ):

日本    名詞,固有名詞,地名,国,*,*,ニッポン,日本,日本,ニッポン,日本,ニッポン,固,*,*,*,*
語      名詞,普通名詞,一般,*,*,*,ゴ,語,語,ゴ,語,ゴ,漢,*,*,*,*
です    助動詞,*,*,*,助動詞-デス,終止形-一般,デス,です,です,デス,です,デス,和,*,*,*,*
よ      助詞,終助詞,*,*,*,*,ヨ,よ,よ,ヨ,よ,ヨ,和,*,*,*,*

On some platforms mecab is statically linked in MySQL so you don't need a system installation, but the docs indicate that's not always the case.

Are your encoding settings correct?

The default character set of your table is latin1 , which won't work with Japanese text. I would suggest using utf8, and you'll need to check that your mecab installation supports that.

Hope that helps.

It turns out that the entire table must be encoded, not just the columns. Well, at least, this was the one significant difference that I made when I reconstituted the table.

No matter, the parser does not appear in the myPhpAdmin table section where parsers are apparently suppose to appear. This is, perhaps, due to the way the parser appears in the table's SHOW CREATE statement. In any case, this is a small shortcoming when compared with the parser's overall functionality.

Roddy

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM