How to get the last Unicode text element of a string without iterating the entire string? There are two ways offered by System.Globalization.StringInfo, but I suspect them to enumerate the entire string:
[TestMethod]
[TestCategory("Verification")]
public void GetLastTextElement_TextEndsWithSurrogatePair_GetsSurrogatePair()
{
// Arrange
const string OsmanyaDigitOne = "\U000104A1";
const string OsmanyaDigitTwo = "\U000104A2";
const string Target = "abc" + OsmanyaDigitOne + "de" + OsmanyaDigitTwo;
// Act
int length = Target.Length;
string lastSubstring = Target.Substring(length - 1);
StringInfo stringInfo = new StringInfo(Target);
int lengthInTextElements = stringInfo.LengthInTextElements;
string lastTextElement = stringInfo.SubstringByTextElements(lengthInTextElements - 1);
string lastTextElementInOneExpression = Target.Substring(StringInfo.ParseCombiningCharacters(Target).Last());
// Assert
Assert.AreEqual(9, length, @"Wrong length");
Assert.AreNotEqual(OsmanyaDigitTwo, lastSubstring, @"Unexpectedly got last text element");
Assert.AreEqual(7, lengthInTextElements, @"Wrong length in text elements");
Assert.AreEqual(OsmanyaDigitTwo, lastTextElement, @"Wrong last text element");
Assert.AreEqual(OsmanyaDigitTwo, lastTextElementInOneExpression, @"Wrong last text element");
}
last Unicode text element efficiently
If you simply mean the last unicode codepoint then it is quite easy:
string unicode = Target.Length >= 2 && char.IsLowSurrogate(Target, Target.Length - 1) && char.IsHighSurrogate(Target, Target.Length - 2)
? Target.Substring(Target.Length - 2, 2)
: Target.Substring(Target.Length - 1, 1).ToString();
If you mean the last grapheme (so the last codepoint together with for example the combining marks that could follow it, so potentially multiple codepoints, like e
+ ◌̃
) it is more complex.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.