简体   繁体   中英

Convert char pos of UnicodeString to byte pos in a utf8 string

I use Scintilla and set it's encoding to utf8 (and this is the only way to make it compatible with Unicode characters, if I understand it correctly). With this set up, when talking about a positions in the text Scintilla means byte positions.

The problem is, I use UnicodeString in the rest of my program, and when I need to select a particular rang in the Scintilla editor, I need to convert from char pos of the UnicodeString to byte pos in a utf8 string that's corresponding to the UnicodeString. How can I do that easily? Thanks.

PS, when I found ByteToCharIndex I thought it's what I need, however, according to its document and the result of my testing, it only works If the system uses a multi-byte character system (MBCS).

You should parse UTF8 strings yourself using UTF8 description . I have written a quick UTF8 analog of ByteToCharIndex and tested on cyrillic string:

function UTF8PosToCharIndex(const S: UTF8String; Index: Integer): Integer;
var
  I: Integer;
  P: PAnsiChar;

begin
  Result:= 0;
  if (Index <= 0) or (Index > Length(S)) then Exit;
  I:= 1;
  P:= PAnsiChar(S);
  while I <= Index do begin
    if Ord(P^) and $C0 <> $80 then Inc(Result);
    Inc(I);
    Inc(P);
  end;
end;

const TestStr: UTF8String = 'abФЫВА';

procedure TForm1.Button2Click(Sender: TObject);
begin
  ShowMessage(IntToStr(UTF8PosToCharIndex(TestStr, 1))); // a = 1
  ShowMessage(IntToStr(UTF8PosToCharIndex(TestStr, 2))); // b = 2
  ShowMessage(IntToStr(UTF8PosToCharIndex(TestStr, 3))); // Ф = 3
  ShowMessage(IntToStr(UTF8PosToCharIndex(TestStr, 5))); // Ы = 4
  ShowMessage(IntToStr(UTF8PosToCharIndex(TestStr, 7))); // В = 5
end;

The reverse function is no problem too:

function CharIndexToUTF8Pos(const S: UTF8String; Index: Integer): Integer;
var
  P: PAnsiChar;

begin
  Result:= 0;
  P:= PAnsiChar(S);
  while (Result < Length(S)) and (Index > 0) do begin
    Inc(Result);
    if Ord(P^) and $C0 <> $80 then Dec(Index);
    Inc(P);
  end;
  if Index <> 0 then Result:= 0;  // char index not found
end;

I wrote a function based on Serg's code with great respect, I posted it here as a separate answer with the hope that it's helpful to others too. Serg's answer is accepted instead.

{Return the index (1-based) of the first byte of the character (unicode point) specified by aCharIdx (1-based) in aUtf8Str.

Code is amended by Edwin Yip based on code written by SO member Serg ( https://stackoverflow.com/users/246408/serg )

ref 1: https://stackoverflow.com/a/10388131/133516

ref 2: http://sergworks.wordpress.com/2012/05/01/parsing-utf8-strings/ }

function CharPosToUTF8BytePos(const aUtf8Str: UTF8String; const aCharIdx:
    Integer): Integer;
var
  p: PAnsiChar;
  charCount: Integer;
begin
  p:= PAnsiChar(aUtf8Str);
  Result:= 0;
  charCount:= 0;
  while (Result < Length(aUtf8Str)) do
  begin
    if IsUTF8LeadChar(p^) then
      Inc(charCount);

    if charCount = aCharIdx then
      Exit(Result + 1);

    Inc(p);
    Inc(Result);
  end;
end;

Both UTF-8 and UTF-16 (what UnicodeString uses) are variable-length encodings. A given Unicode codepoint can be encoded in UTF-8 using between 1-4 single-byte codeunits, and in UTF-16 using either 1 or 2 2-byte codeunits, depending on the codepoint's numeric value. The only way to translate a position in a UTF-16 string into a position in an equivilent UTF-8 string is to decode the UTF-16 codeunits preceeding the position back to their original Unicode codepoint values and then re-encode them to UTF-8 codeunits.

It sounds like you are better off re-writting the code that interacts with Scintilla to use UTF8String instead of UnicodeString , then you won't have to translate between UTF-8 and UTF-16 at that layer anymore. When interacting with the rest of your code, you can convert between UTF8String and UnicodeString as needed.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM