简体   繁体   中英

Word Segmentation using ICU

I am using ICU4C to transliterate CJK. I am wondering whether it is possible to have word segmentation in ICU, to split Chinese text into a sequence of words, defined according to some word segmentation standard.

When I try transliterating for example:



Transliterator* myTrans = 
                  Transliterator::createInstance("zh-Latin",UTRANS_FORWARD, err);
UnicodeString str;
std::cout << st << std::endl;

I get the following output:

zhí jiē shū chū html dài mǎ ér bù shì zuò wèi hán shù fǎn huí zhí dài hòu chù lǐ

It seems perfectly fine checking against online pinyin tools, but my problem is ICU's transliteration the characters one by one. What I'm looking for, though, is something more like the text below (I don't know any Chinese, so probably the text below doesn't mean anything, but it should demonstrate what kind of output I'm interested in):

zhíjiē shūchū html dàimǎér bùshì zuò wèihán shùfǎn huízhídài hòu chùlǐ

I have been told that ICU 50 is capable of word segmentation, but I couldn't find any document in their web page neither on web. Wanted to know if any of you guys have worked with word segmentation in ICU or know how to do it, or if you have any good link on how to do so.

"Dictionary Based Iterator" isn't a different API. Just create an ICU word break iterator with the appropriate locale ID.

There's a C/C++ sample that comes with ICU in icu/source/samples/break

Also the following sample code shows word breaking: http://source.icu-project.org/repos/icu/icuapps/trunk/iucsamples/c/s24_brkw/s24_brkw.cpp http://source.icu-project.org/repos/icu/icuapps/trunk/iucsamples/c/s23_brki/

probably something like this:

  BreakIterator *wordIterator = BreakIterator::createWordInstance(Locale("zh"), status);
UnicodeString text = "Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.";
  int32_t breakCount = 0;
    int32_t start = wordIterator->first();
    for(int32_t end = wordIterator->next();
        end != BreakIterator::DONE;
        start = end, end = wordIterator->next())
  delete wordIterator;

This is the reply I got from ICU's mailig list:

"There's a brand new online demo in progress also, that does the segmentation and splits your text as the following - when Chinese is selected. hope this helps."


This would solve my problem, I need to transliterate this output to get What I look for.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM