I'm using ICU4C and trying to find the clusters in a UTF-8 string that are emojis. This is the closest I've gotten so far but it incorrectly qualifies the simple character '#' as an emoji (because '#️⃣' begins with '#' and is "potentially" an emoji so '#' does carry the property UCHAR_EMOJI
).
I think the best would be trying to get the property RGI_Emoji
as indicated here but that's a "string" property and not a "codepoint" property and I don't know how to do that. If I could I would analyze every single character as a "string" and test for that string property. The documentation states that, currently, using a regular expression is not possible to get "string" properties.
const std::string s8 = "#🤙🏿asd🧔🏼😵💫dds🫥😶🌫️🏌️♂️🇨🇦ds#️⃣🏋🏽ds👨👩👦👦ds👩🏾❤️💋👨🏼ds";
const icu::UnicodeString us = icu::UnicodeString::fromUTF8(s8);
UErrorCode status = U_ZERO_ERROR;
icu::BreakIterator* bi = icu::BreakIterator::createCharacterInstance(icu::Locale::getUS(), status);
bi->setText(us);
bool is_emoji = false;
for(int32_t e = bi->first(), b = e; e != icu::BreakIterator::DONE; b = e, e = bi->next())
{
// Analyze character for emoji-ness.
for(int32_t i = b; i != e; ++i)
{
std::cout << us.char32At(i) << ' ';
is_emoji = u_hasBinaryProperty(us.char32At(i), UProperty::UCHAR_EMOJI) || u_hasBinaryProperty(us.char32At(i), UProperty::UCHAR_EMOJI_COMPONENT);
}
if(is_emoji)
{
std::cout << "<- is emoji\n";
++emojis;
is_emoji = false;
}
else
{
std::cout << "<- is not emoji\n";
}
++characters;
}
delete bi;
Looks like u_stringHasBinaryProperty will give you access to UCHAR_RGI_EMOJI
. Note that this method is not available in ICU versions < 70.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.