remove 4 byte UTF8 characters

Question

I'd like to remove 4 byte UTF8 characters which starts with \\xF0 (the char with the ASCII code 0xF0) from a string and tried

sText = Regex.Replace (sText, "\xF0...", "");

This doesn't work. Using two backslashes did not work neither.

The exact input is the content of https://de.wikipedia.org/w/index.php?title=Spezial:Exportieren&action=submit&pages=Unicode The 4 byte character ist the one after the text "[[Violinschlüssel]] ", in hex notation: .. 0x65 0x6c 0x5d 0x5d 0x20 0xf0 0x9d 0x84 0x9e 0x20 .. The expected output is 0x65 0x6c 0x5d 0x5d 0x20 0x20 ..

What's wrong?

Answer 1

Such characters will be surrogate pairs in .NET which uses UTF-16. Each of them will be two UTF-16 code units, that is two char values.

To just remove them, you can do ( using System.Linq; ):

sText = string.Concat(sText.Where(x => !char.IsSurrogate(x)));

(uses an overload of Concat introduced in .NET 4.0 (Visual Studio 2010)).

Late addition: It may give better performance to use:

sText = new string(sText.Where(x => !char.IsSurrogate(x)).ToArray());

even if it looks worse. (Works in .NET 3.5 (Visual Studio 2008).)

Answer 2

You are trying to search for byte values but C# strings are made from char values. The C# language spec at section "2.4.4.4 Character literals" states:

A character literal represents a single character, and usually consists of a character in quotes, as in 'a'.
...
A hexadecimal escape sequence represents a single Unicode character , with the value formed by the hexadecimal number following \\x .

Hence the search for "\\xF0..." is searching for the character U+F0 which would be represented by the bytes C3 B0 .

If you want find replace all Unicode characters whose first byte is 0xF0 then I believe you need to search for the character values whose first byte if 0xFO.

The character U+10000 is represented as F0 90 80 80 (the preceding code is U+FFFF which is EF BF BF ). The first code with F1 .... .. is U+40000 which is F1 80 80 80 and the value before it is U+3FFFF which is F0 BF BF BF .

Hence you need to remove characters in the range U+10000 to U+3FFFF . This should be possible with a regular expression such as

sText = Regex.Replace (sText, "[\\x10000-\\x3FFFF]", "");

The relevant characters from the source quoted in the question have been extracted into the code below. The code then tries to understand how the characters are held in strings.

static void Main(string[] args)
{
    string input = "] 𝄞 (";
    Console.Write("Input length  {0} : '{1}'  : ", input.Length, input);
    foreach (char cc in input)
    {
        Console.Write("  {0,2:X02}", (int)cc);
    }
    Console.WriteLine();
}

The output from the program is as below. This supports the surrogate pair explanation given by @Jeppe in his answer.

Input length  6 : '] ?? ('  :   5D  20  D834  DD1E  20  28

remove 4 byte UTF8 characters

Question

2 answers

solution1
5 ACCPTED 2016-08-02 09:15:56

solution2
2 2016-08-02 08:19:23

remove 4 byte UTF8 characters

Question

2 answers

solution1 5 ACCPTED 2016-08-02 09:15:56

solution2 2 2016-08-02 08:19:23

solution1
5 ACCPTED 2016-08-02 09:15:56

solution2
2 2016-08-02 08:19:23