Retrieve Unicode code points > U+FFFF from QChar

Question

I have an application that is supposed to deal with all kinds of characters and at some point display information about them. I use Qt and its inherent Unicode support in QChar, QString etc.

Now I need the code point of a QChar in order to look up some data in http://unicode.org/Public/UNIDATA/UnicodeData.txt , but QChar's unicode() method only returns a ushort (unsigned short), which usually is a number from 0 to 65535 (or 0xFFFF). There are characters with code points > 0xFFFF, so how do I get these? Is there some trick I am missing or is this currently not supported by Qt/QChar?

Answer 1

Each QChar is a UTF-16 value, not a complete Unicode codepoint. Therefore, non-BMP characters consist of two QChar surrogate pairs.

Answer 2

Unicode characters beyond U+FFFF in Qt

QChar itself only supports Unicode characters up to U+FFFF .

QString supports Unicode characters beyond U+FFFF by concatenating two QChars (that is, by using UTF-16 encoding). However, the QString API doesn't help you much if you need to process characters beyond U+FFFF . As an example, a QString instance which contains the single Unicode character U+131F6 will return a size of 2, not 1.

I've opened QTBUG-18868 about this problem back in 2011, but after more than three years (!) of discussion, it was finally closed as "out of scope" without any resolution.

Solution

You can, however, download and use these Unicode Qt string wrapper classes which have been attached to the Qt bug report. Licensed under the LGPL.

This download contains the wrapper classes QUtfString , QUtfChar , QUtfRegExp and QUtfStringList which supplement the existing Qt classes and allow you to do things like this:

QUtfString str;
str.append(0x1307C);            // Some Unicode character beyond U+FFFF

Q_ASSERT(str.size() == 1);
Q_ASSERT(str[0] == 0x1307C);

str += 'a';

Q_ASSERT(str.size() == 2);
Q_ASSERT(str[1] == 'a');
Q_ASSERT(str.indexOf('a') == 1);

For further details about the implementation, usage and runtime complexity please see the API documentation included within the download.

Answer 3

The solution appears to lay in code that is documented but not seen much on the Web. You can get the utf-8 value in decimal form. You then apply to determine if a single QChar is large enough. In this case it is not. Then you need to create two QChar's.

uint32_t cp = 155222; // a 4-byte Japanese character 
QString str;
if(Qchar::requiresSurrogate(cp))
{
    QChar charArray[2];
    charArray[0] = QChar::highSurrogate(cp);
    charArray[1] = QChar::lowSurrogate(cp);
    str =  QString(charArray, 2);
}

The resulting QString will contain the correct information to display your supplemental utf-8 character.

Retrieve Unicode code points > U+FFFF from QChar

Question

3 answers

solution1
6 ACCPTED 2011-08-07 12:43:57

solution2
2 2014-04-04 10:28:23

Unicode characters beyond U+FFFF in Qt

Solution

solution3
1 2017-04-21 16:47:06

Retrieve Unicode code points > U+FFFF from QChar

Question

3 answers

solution1 6 ACCPTED 2011-08-07 12:43:57

solution2 2 2014-04-04 10:28:23

Unicode characters beyond U+FFFF in Qt

Solution

solution3 1 2017-04-21 16:47:06

solution1
6 ACCPTED 2011-08-07 12:43:57

solution2
2 2014-04-04 10:28:23

solution3
1 2017-04-21 16:47:06