Convert char pos of UnicodeString to byte pos in a utf8 string

Question

I use Scintilla and set it's encoding to utf8 (and this is the only way to make it compatible with Unicode characters, if I understand it correctly). With this set up, when talking about a positions in the text Scintilla means byte positions.

The problem is, I use UnicodeString in the rest of my program, and when I need to select a particular rang in the Scintilla editor, I need to convert from char pos of the UnicodeString to byte pos in a utf8 string that's corresponding to the UnicodeString. How can I do that easily? Thanks.

PS, when I found ByteToCharIndex I thought it's what I need, however, according to its document and the result of my testing, it only works If the system uses a multi-byte character system (MBCS).

Answer 1

You should parse UTF8 strings yourself using UTF8 description . I have written a quick UTF8 analog of ByteToCharIndex and tested on cyrillic string:

function UTF8PosToCharIndex(const S: UTF8String; Index: Integer): Integer;
var
  I: Integer;
  P: PAnsiChar;

begin
  Result:= 0;
  if (Index <= 0) or (Index > Length(S)) then Exit;
  I:= 1;
  P:= PAnsiChar(S);
  while I <= Index do begin
    if Ord(P^) and $C0 <> $80 then Inc(Result);
    Inc(I);
    Inc(P);
  end;
end;

const TestStr: UTF8String = 'abФЫВА';

procedure TForm1.Button2Click(Sender: TObject);
begin
  ShowMessage(IntToStr(UTF8PosToCharIndex(TestStr, 1))); // a = 1
  ShowMessage(IntToStr(UTF8PosToCharIndex(TestStr, 2))); // b = 2
  ShowMessage(IntToStr(UTF8PosToCharIndex(TestStr, 3))); // Ф = 3
  ShowMessage(IntToStr(UTF8PosToCharIndex(TestStr, 5))); // Ы = 4
  ShowMessage(IntToStr(UTF8PosToCharIndex(TestStr, 7))); // В = 5
end;

The reverse function is no problem too:

function CharIndexToUTF8Pos(const S: UTF8String; Index: Integer): Integer;
var
  P: PAnsiChar;

begin
  Result:= 0;
  P:= PAnsiChar(S);
  while (Result < Length(S)) and (Index > 0) do begin
    Inc(Result);
    if Ord(P^) and $C0 <> $80 then Dec(Index);
    Inc(P);
  end;
  if Index <> 0 then Result:= 0;  // char index not found
end;

Answer 2

I wrote a function based on Serg's code with great respect, I posted it here as a separate answer with the hope that it's helpful to others too. Serg's answer is accepted instead.

{Return the index (1-based) of the first byte of the character (unicode point) specified by aCharIdx (1-based) in aUtf8Str.

Code is amended by Edwin Yip based on code written by SO member Serg ( https://stackoverflow.com/users/246408/serg )

ref 1: https://stackoverflow.com/a/10388131/133516

ref 2: http://sergworks.wordpress.com/2012/05/01/parsing-utf8-strings/ }

function CharPosToUTF8BytePos(const aUtf8Str: UTF8String; const aCharIdx:
    Integer): Integer;
var
  p: PAnsiChar;
  charCount: Integer;
begin
  p:= PAnsiChar(aUtf8Str);
  Result:= 0;
  charCount:= 0;
  while (Result < Length(aUtf8Str)) do
  begin
    if IsUTF8LeadChar(p^) then
      Inc(charCount);

    if charCount = aCharIdx then
      Exit(Result + 1);

    Inc(p);
    Inc(Result);
  end;
end;

Answer 3

Both UTF-8 and UTF-16 (what UnicodeString uses) are variable-length encodings. A given Unicode codepoint can be encoded in UTF-8 using between 1-4 single-byte codeunits, and in UTF-16 using either 1 or 2 2-byte codeunits, depending on the codepoint's numeric value. The only way to translate a position in a UTF-16 string into a position in an equivilent UTF-8 string is to decode the UTF-16 codeunits preceeding the position back to their original Unicode codepoint values and then re-encode them to UTF-8 codeunits.

It sounds like you are better off re-writting the code that interacts with Scintilla to use UTF8String instead of UnicodeString , then you won't have to translate between UTF-8 and UTF-16 at that layer anymore. When interacting with the rest of your code, you can convert between UTF8String and UnicodeString as needed.

Convert char pos of UnicodeString to byte pos in a utf8 string

Question

3 answers

solution1
3 ACCPTED 2012-04-30 17:46:37

solution2
1 2012-05-01 05:16:11

solution3
0 2012-04-30 17:31:39

Convert char pos of UnicodeString to byte pos in a utf8 string

Question

3 answers

solution1 3 ACCPTED 2012-04-30 17:46:37

solution2 1 2012-05-01 05:16:11

solution3 0 2012-04-30 17:31:39

solution1
3 ACCPTED 2012-04-30 17:46:37

solution2
1 2012-05-01 05:16:11

solution3
0 2012-04-30 17:31:39