简体   繁体   中英

How to test a string for containing emoji characters?

I'm using ICU4C and trying to find the clusters in a UTF-8 string that are emojis. This is the closest I've gotten so far but it incorrectly qualifies the simple character '#' as an emoji (because '#️⃣' begins with '#' and is "potentially" an emoji so '#' does carry the property UCHAR_EMOJI ).

I think the best would be trying to get the property RGI_Emoji as indicated here but that's a "string" property and not a "codepoint" property and I don't know how to do that. If I could I would analyze every single character as a "string" and test for that string property. The documentation states that, currently, using a regular expression is not possible to get "string" properties.

const std::string s8 = "#🤙🏿asd🧔🏼😵‍💫dds🫥😶‍🌫️🏌️‍♂️🇨🇦ds#️⃣🏋🏽ds👨‍👩‍👦‍👦ds👩🏾‍❤️‍💋‍👨🏼ds";
const icu::UnicodeString us = icu::UnicodeString::fromUTF8(s8);
UErrorCode status = U_ZERO_ERROR;
icu::BreakIterator* bi = icu::BreakIterator::createCharacterInstance(icu::Locale::getUS(), status);
bi->setText(us);
bool is_emoji = false;
for(int32_t e = bi->first(), b = e; e != icu::BreakIterator::DONE; b = e, e = bi->next())
{
    // Analyze character for emoji-ness.
    for(int32_t i = b; i != e; ++i)
    {
        std::cout << us.char32At(i) << ' ';
        is_emoji = u_hasBinaryProperty(us.char32At(i), UProperty::UCHAR_EMOJI) || u_hasBinaryProperty(us.char32At(i), UProperty::UCHAR_EMOJI_COMPONENT);
    }
    if(is_emoji)
    {
        std::cout << "<- is emoji\n";
        ++emojis;
        is_emoji = false;
    }
    else
    {
        std::cout << "<- is not emoji\n";
    }
    ++characters;

}
delete bi;

Looks like u_stringHasBinaryProperty will give you access to UCHAR_RGI_EMOJI . Note that this method is not available in ICU versions < 70.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM