简体   繁体   中英

How to compare ICU iterators values?

I am writing KMP sub-string searching alg on unicode strings in C using UCharIterators, the problem I am facing is that I need to compare values by iterator and comparison should be normalized, while all of the ICU colls absorb strings and not individual chars.

UCharIterator first_iter, second_iter

uiter_setUTF8( &first_iter, needle_str, n_needle_bytes);
uiter_setUTF8(&second_iter, needle_str, n_needle_bytes);

...
if (firts_iter.current(&first_iter) != second_iter.current(&second_iter)) {
    ...

the current condition fails on 'a' and 'ä' while I don't want it too. I don't like the idea of pre-normalization as it requires O(n + m) additional memory (to the best of my knowledge ICU doesn't have a function to do it in-place)

I had to switch to U8_* macro for UTF-8 ICU strings. Moved offset with U8_NEXT

U8_NEXT((uint8_t *)string, string_offset, string_size, status);

And compared like this

U8_GET((uint8_t *)key, 0,  first_key_end, key_size,  first_key_c);
U8_GET((uint8_t *)key, 0, second_key_end, key_size, second_key_c);
if (coll->cmp(key +  first_key_end, U8_LENGTH(first_key_c),
              key + second_key_end, U8_LENGTH(second_key_c),
              coll)

that is, calculated length of a single letter with U8_LENGTH by the first code point (and not offset or part of a string). More on that here https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/utf8_8h.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM