I use Scintilla and set it's encoding to utf8 (and this is the only way to make it compatible with Unicode characters, if I understand it correctly). With this set up, when talking about a positions in the text Scintilla means byte positions.
The problem is, I use UnicodeString in the rest of my program, and when I need to select a particular rang in the Scintilla editor, I need to convert from char pos of the UnicodeString to byte pos in a utf8 string that's corresponding to the UnicodeString. How can I do that easily? Thanks.
PS, when I found ByteToCharIndex I thought it's what I need, however, according to its document and the result of my testing, it only works If the system uses a multi-byte character system (MBCS).
You should parse UTF8 strings yourself using UTF8 description . I have written a quick UTF8 analog of ByteToCharIndex
and tested on cyrillic string:
function UTF8PosToCharIndex(const S: UTF8String; Index: Integer): Integer;
var
I: Integer;
P: PAnsiChar;
begin
Result:= 0;
if (Index <= 0) or (Index > Length(S)) then Exit;
I:= 1;
P:= PAnsiChar(S);
while I <= Index do begin
if Ord(P^) and $C0 <> $80 then Inc(Result);
Inc(I);
Inc(P);
end;
end;
const TestStr: UTF8String = 'abФЫВА';
procedure TForm1.Button2Click(Sender: TObject);
begin
ShowMessage(IntToStr(UTF8PosToCharIndex(TestStr, 1))); // a = 1
ShowMessage(IntToStr(UTF8PosToCharIndex(TestStr, 2))); // b = 2
ShowMessage(IntToStr(UTF8PosToCharIndex(TestStr, 3))); // Ф = 3
ShowMessage(IntToStr(UTF8PosToCharIndex(TestStr, 5))); // Ы = 4
ShowMessage(IntToStr(UTF8PosToCharIndex(TestStr, 7))); // В = 5
end;
The reverse function is no problem too:
function CharIndexToUTF8Pos(const S: UTF8String; Index: Integer): Integer;
var
P: PAnsiChar;
begin
Result:= 0;
P:= PAnsiChar(S);
while (Result < Length(S)) and (Index > 0) do begin
Inc(Result);
if Ord(P^) and $C0 <> $80 then Dec(Index);
Inc(P);
end;
if Index <> 0 then Result:= 0; // char index not found
end;
I wrote a function based on Serg's code with great respect, I posted it here as a separate answer with the hope that it's helpful to others too. Serg's answer is accepted instead.
{Return the index (1-based) of the first byte of the character (unicode point) specified by aCharIdx (1-based) in aUtf8Str.
Code is amended by Edwin Yip based on code written by SO member Serg ( https://stackoverflow.com/users/246408/serg )
ref 1: https://stackoverflow.com/a/10388131/133516
ref 2: http://sergworks.wordpress.com/2012/05/01/parsing-utf8-strings/ }
function CharPosToUTF8BytePos(const aUtf8Str: UTF8String; const aCharIdx:
Integer): Integer;
var
p: PAnsiChar;
charCount: Integer;
begin
p:= PAnsiChar(aUtf8Str);
Result:= 0;
charCount:= 0;
while (Result < Length(aUtf8Str)) do
begin
if IsUTF8LeadChar(p^) then
Inc(charCount);
if charCount = aCharIdx then
Exit(Result + 1);
Inc(p);
Inc(Result);
end;
end;
Both UTF-8 and UTF-16 (what UnicodeString
uses) are variable-length encodings. A given Unicode codepoint can be encoded in UTF-8 using between 1-4 single-byte codeunits, and in UTF-16 using either 1 or 2 2-byte codeunits, depending on the codepoint's numeric value. The only way to translate a position in a UTF-16 string into a position in an equivilent UTF-8 string is to decode the UTF-16 codeunits preceeding the position back to their original Unicode codepoint values and then re-encode them to UTF-8 codeunits.
It sounds like you are better off re-writting the code that interacts with Scintilla to use UTF8String
instead of UnicodeString
, then you won't have to translate between UTF-8 and UTF-16 at that layer anymore. When interacting with the rest of your code, you can convert between UTF8String
and UnicodeString
as needed.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.