简体   繁体   English

从 Delphi 字符串中检测和检索代码点和代理

[英]Detecting and Retrieving codepoints and surrogates from a Delphi String

I am trying to better understand surrogate pairs and Unicode implementation in Delphi.我试图更好地理解 Delphi 中的代理对和 Unicode 实现。

If I call length() on the Unicode string S := 'Ĥà̲V̂e' in Delphi, I will get back, 8.如果我在 Delphi 中对 Unicode 字符串 S := 'Ĥà̲V̂e' 调用 length(),我会返回,8。

This is because the lengths of the individual characters [Ĥ],[à̲],[V̂], and [e] are 2, 3, 2, and 1 respectively.这是因为单个字符[Ĥ]、[à̲]、[V̂]和[e]的长度分别为2、3、2和1。 This is because Ĥ has a surrogate, à̲ has two additional surrogates, V̂ has a surrogate and e has no surrogates.这是因为 Ĥ 有一个代理,à̲ 有两个额外的代理,V̂ 有一个代理,而 e 没有代理。

If I wanted to return the second element in the string including all surrogates, [à̲], how would I do that?如果我想返回包含所有代理项的字符串中的第二个元素 [à̲],我该怎么做? I know I would need to do some sort of testing of the individual bytes.我知道我需要对单个字节进行某种测试。 I ran some tests using the routine我使用例程进行了一些测试

function GetFirstCodepointSize(const S: UTF8String): Integer;  

referenced in this SO Question .此 SO 问题中引用。

but got some unusual results, eg, here are some length and sizes of some different codepoints.但得到了一些不寻常的结果,例如,这里有一些不同代码点的长度和大小。 Below is a snippet of how I generated these tables.下面是我如何生成这些表的片段。

...
UTFCRUDResultStrings.add('INPUT: '+#9#9+ DATA +#9#9+ 'GetFirstCodePointSize = ' +intToStr(GetFirstCodepointSize(DATA))
+#9#9+ 'Length =' + intToStr(length(DATA)));
...

First Set: This makes sense to me, each code point size is doubled, but these are one character each and Delphi gives me the length as just 1, perfect.第一组:这对我来说很有意义,每个代码点的大小都加倍了,但这些都是一个字符,Delphi 给我的长度仅为 1,完美。

INPUT:      ď       GetFirstCodePointSize = 2       Length =1
INPUT:      ơ       GetFirstCodePointSize = 2       Length =1
INPUT:      ǥ       GetFirstCodePointSize = 2       Length =1

Second set: It initially looks to me like the lengths and code points are reversed?第二组:最初在我看来长度和代码点是相反的? I am guessing the reason for this is that the characters + surrogates are being treated individually, hence the first codepoint size is for the 'H', which is 1, but the length is returning the lengths of 'H' plus '^'.我猜测这样做的原因是字符 + 代理被单独处理,因此第一个代码点大小用于“H”,即 1,但长度返回“H”加“^”的长度。

INPUT:      Ĥ      GetFirstCodePointSize = 1       Length =2
INPUT:      à̲     GetFirstCodePointSize = 1       Length =3
INPUT:      V̂      GetFirstCodePointSize = 1       Length =2
INPUT:      e       GetFirstCodePointSize = 1       Length =1

Some additional tests...一些额外的测试...

INPUT:      ¼       GetFirstCodePointSize = 2       Length =1
INPUT:      ₧       GetFirstCodePointSize = 3       Length =1
INPUT:      𤭢      GetFirstCodePointSize = 4       Length =2
INPUT:      ß       GetFirstCodePointSize = 2       Length =1
INPUT:      𨳒      GetFirstCodePointSize = 4       Length =2

Is there a reliable way in Delphi to determine where an element in a Unicode String starts and ends? Delphi 中是否有一种可靠的方法来确定 Unicode 字符串中元素的开始和结束位置?

I know my terminology using the word element may be off, but I don't think codepoint and character are right either, particularly given that one element may have a codepoint size of 3, but have a length of only one.我知道我使用单词 element 的术语可能不正确,但我认为 codepoint 和 character 也不正确,特别是考虑到一个元素的 codepoint 大小可能为 3,但长度仅为 1。

I am trying to better understand surrogate pairs and Unicode implementation in Delphi.我试图更好地理解 Delphi 中的代理对和 Unicode 实现。

Let's get some terminology out of the way.让我们先了解一些术语。

Each "character" (known as a grapheme ) that is defined by Unicode is assigned a unique codepoint .由的Unicode定义的每个“字符”(被称为石墨烯)被分配一个唯一的代码点

In a Unicode Transformation Format (UTF) encoding - UTF-7, UTF-8, UTF-16, and UTF-32 - each codepoint is encoded as a sequence of codeunits .在一个Unicode转换格式(UTF)编码- UTF-7,UTF-8,UTF-16,和UTF-32 -每个码点被编码为CODEUNITS的序列。 The size of each codeunit is determined by the encoding - 7 bits for UTF-7, 8 bits for UTF-8, 16 bits for UTF-16, and 32 bits for UTF-32 (hence their names).每个代码单元的大小由编码决定 - UTF-7 为 7 位,UTF-8 为 8 位,UTF-16 为 16 位,UTF-32 为 32 位(因此得名)。

In Delphi 2009 and later, String is an alias for UnicodeString , and Char is an alias for WideChar .在 Delphi 2009 及更高版本中, StringUnicodeString的别名,而CharWideChar的别名。 WideChar is 16 bits. WideChar是 16 位。 A UnicodeString holds a UTF-16 encoded string (in earlier versions of Delphi, the equivalent string type was WideString ), and each WideChar is a UTF-16 codeunit. UnicodeString包含一个UTF-16编码的字符串(在早期版本的 Delphi 中,等效的字符串类型是WideString ),每个WideChar是一个 UTF-16 代码单元。

In UTF-16, a codepoint can be encoded using either 1 or 2 codeunits.在 UTF-16 中,可以使用 1 或 2 个代码单元对代码点进行编码。 1 codeunit can encode codepoint values in the Basic Multilingual Plane (BMP) range - $0000 to $FFFF, inclusive. 1 个代码单元可以对基本多语言平面 (BMP) 范围内的代码点值进行编码 - $0000 到 $FFFF,含。 Higher codepoints require 2 codeunits, which is also known as a surrogate pair .更高的代码点需要 2 个代码单元,也称为代理对

If I call length() on the Unicode string S := 'Ĥà̲V̂e' in Delphi, I will get back, 8.如果我在 Delphi 中对 Unicode 字符串 S := 'Ĥà̲V̂e' 调用 length(),我会返回,8。

This is because the lengths of the individual characters [Ĥ],[à̲],[V̂], and [e] are 2, 3, 2, and 1 respectively.这是因为单个字符[Ĥ]、[à̲]、[V̂]和[e]的长度分别为2、3、2和1。

This is because Ĥ has a surrogate, à̲ has two additional surrogates, V̂ has a surrogate and e has no surrogates.这是因为 Ĥ 有一个代理,à̲ 有两个额外的代理,V̂ 有一个代理,而 e 没有代理。

Yes, there are 8 WideChar elements (codeunits) in your UTF-16 UnicodeString .是的,您的 UTF-16 UnicodeString有 8 个WideChar元素(代码单元)。 What you are calling "surrogates" are actually known as "combining marks".您所说的“代理”实际上被称为“组合标记”。 Each combining mark is its own unique codepoint, and thus its own codeunit sequence.每个组合标记是它自己唯一的代码点,因此是它自己的代码单元序列。

If I wanted to return the second element in the string including all surrogates, [à̲], how would I do that?如果我想返回包含所有代理项的字符串中的第二个元素 [à̲],我该怎么做?

You have to start at the beginning of the UnicodeString and analyze each WideChar until you find one that is not a combining mark attached to a previous WideChar .您必须从UnicodeString的开头开始并分析每个WideChar直到找到不是附加到前一个WideChar的组合标记的那个。 On Windows, the easiest way to do that is to use the CharNextW() function, eg:在 Windows 上,最简单的方法是使用CharNextW()函数,例如:

var
  S: String;
  P: PChar;
begin
  S := 'Ĥà̲V̂e';
  P := CharNext(PChar(S)); // returns a pointer to  à̲
end;

The Delphi RTL does not have an equivalent function. Delphi RTL 没有等效的功能。 You would have write one manually, or use a third-party library.您可以手动编写一个,或者使用第三方库。 The RTL does have a StrNextChar() function, but it only handles UTF-16 surrogates, not combining marks ( CharNext() handles both). RTL 确实有一个StrNextChar()函数,但它只处理 UTF-16 代理,而不是组合标记( CharNext()处理两者)。 So, you could use StrNextChar() to scan through each codepoint in the UnicodeString , but you have to loo at each codepoint to know whether it is a combining mark or not, eg:所以,你可以使用StrNextChar()通过在每个码点扫描UnicodeString ,但你必须厕所在每个码点知道它是否是一个组合标志或没有,例如:

uses
  Character;

function MyCharNext(P: PChar): PChar;
begin
  if (P <> nil) and (P^ <> #0) then
  begin
    Result := StrNextChar(P);
    while GetUnicodeCategory(Result^) = ucCombiningMark do
      Result := StrNextChar(Result);
  end else begin
    Result := nil;
  end;
end;

var
  S: String;
  P: PChar;
begin
  S := 'Ĥà̲V̂e';
  P := MyCharNext(PChar(S)); // should return a pointer to  à̲
end;

I know I would need to do some sort of testing of the individual bytes.我知道我需要对单个字节进行某种测试。

Not the bytes , but the codepoints that they represent when decoded.不是bytes ,而是它们在解码时表示的代码点

I ran some tests using the routine我使用例程进行了一些测试

function GetFirstCodepointSize(const S: UTF8String): Integer函数 GetFirstCodepointSize(const S: UTF8String): 整数

Look closely at that function signature.仔细看看那个函数签名。 See the parameter type?看到参数类型了吗? It is a UTF-8 string, not a UTF-16 string.它是一个UTF-8字符串,而不是一个UTF-16字符串。 This was even stated in the answer you got that function from:甚至在您从以下位置获得该功能的答案中也说明了这一点:

Here is an example how to parse UTF8 string这是一个如何解析UTF8字符串的示例

UTF-8 and UTF-16 are very different encodings, and thus have different semantics. UTF-8 和 UTF-16 是非常不同的编码,因此具有不同的语义。 You cannot use UTF-8 semantics to process a UTF-16 string, and vice versa.您不能使用 UTF-8 语义来处理 UTF-16 字符串,反之亦然。

Is there a reliable way in Delphi to determine where an element in a Unicode String starts and ends? Delphi 中是否有一种可靠的方法来确定 Unicode 字符串中元素的开始和结束位置?

Not directly.不直接。 You have to parse the string from the beginning, skipping elements as needed until you reach the desired element.您必须从头开始解析字符串,根据需要跳过元素,直到到达所需的元素。 Remember that each codepoint may be encoded as either 1 or 2 codeunit elements, and each logical glyph may be encoded using multiple codepoints (and thus multiple codeunit sequences).请记住,每个代码点都可以编码为 1 或 2 个代码单元元素,并且每个逻辑字形都可以使用多个代码点(以及多个代码单元序列)进行编码。

I know my terminology using the word element may be off, but I don't think codepoint and character are right either, particularly given that one element may have a codepoint size of 3, but have a length of only one.我知道我使用单词 element 的术语可能不正确,但我认为 codepoint 和 character 也不正确,特别是考虑到一个元素的 codepoint 大小可能为 3,但长度仅为 1。

1 glyph is comprised of 1+ codepoints, and each codepoint is encoded as 1+ codeunits. 1 个字形由 1+ 个代码点组成,每个代码点被编码为 1+ 个代码单元。

Could someone implement the following function?有人可以实现以下功能吗?

function GetElementAtIndex(S: String; StrIdx : Integer): String;函数 GetElementAtIndex(S: String; StrIdx : Integer): String;

Try something like this:尝试这样的事情:

uses
  SysUtils, Character;

function MyCharNext(P: PChar): PChar;
begin
  Result := P;
  if Result <> nil then
  begin
    Result := StrNextChar(Result);
    while GetUnicodeCategory(Result^) = ucCombiningMark do
      Result := StrNextChar(Result);
  end;
end;

function GetElementAtIndex(S: String; StrIdx : Integer): String;
var
  pStart, pEnd: PChar;
begin
  Result := '';
  if (S = '') or (StrIdx < 0) then Exit;
  pStart := PChar(S);
  while StrIdx > 1 do
  begin
    pStart := MyCharNext(pStart);
    if pStart^ = #0 then Exit; 
    Dec(StrIdx);
  end;
  pEnd := MyCharNext(pStart);
  {$POINTERMATH ON}
  SetString(Result, pStart, pEnd-pStart);
end;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM