I'm parsing a number of text files that contain 99.9% ascii characters. Numbers, basic punctuation and letters AZ (upper and lower case).
The files also contain names, which occasionally contain characters which are part of the extended ascii character set, for example umlauts Ü and cedillas ç.
I want to only work with standard ascii, so I handle these extended characters by processing any names through a series of simple replace() commands...
myString = myString.Replace("ç", "c");
myString = myString.Replace("Ü", "U");
This works with all the strange characters I want to replace except for Ø (capital O with a forward slash through it). I think this has the decimal equivalent of 157.
If I process the string character-by-character using ToInt32() on each character it claims the decimal equivalent is 65533 - well outside the normal range of extended ascii codes.
Questions
myString.Replace("Ø", "O");
work on this character? Other information - may be pertinent. Opening the file with Notepad shows the character as a "Ø". Comparison with other sources indicate that the data is correct (ie the full string is "Jørgensen" - a valid Danish name). Viewing the character in visual studio shows it as "�". I'm getting exactly the same problem (with this one character) in hundreds of different files. I can happily replace all the other extended characters I encounter without problems. I'm using System.IO.File.ReadAllLines()
to read all the lines into an array of strings for processing.
'Ø'
when it 'knows' about it: Console.WriteLine("Jørgensen".Replace("ø", "o"));
In your case the problem is that you are trying to read the data with the wrong encoding, that's why the string does not contain the character which you are trying to replace. Ø
is part of the extended ASCII set - iso-8859-1 , but File.ReadAllLines tries to detect encoding using BOM chars and, I suspect, falls back to UTF-8
in your case (see Remarks in the documentation).
The same behavior you see in the VS code - it tries to open the file with UTF-8 encoding and shows you �: If you switch the encoding to the correct one - it shows the text correctly:
If you know what encoding is used for your files, just use it explicitly, here is an example to illustrate the difference:
// prints J?rgensen
File.ReadAllLines("data.txt")
.Select(l => l.Replace("Ø", "O"))
.ToList()
.ForEach(Console.WriteLine);
// prints Jorgensen
File.ReadAllLines("data.txt",Encoding.GetEncoding("iso-8859-1"))
.Select(l => l.Replace("Ø", "O"))
.ToList()
.ForEach(Console.WriteLine);
public static string RemoveDiacritics(string s)
{
var normalizedString = s.Normalize(NormalizationForm.FormD);
var stringBuilder = new StringBuilder();
for(var i = 0; i < normalizedString.Length; i++)
{
var c = normalizedString[i];
if(CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
stringBuilder.Append(c);
}
return stringBuilder.ToString();
}
...
// prints Jorgensen
File.ReadAllLines("data.txt", Encoding.GetEncoding("iso-8859-1"))
.Select(RemoveDiacritics)
.ToList()
.ForEach(Console.WriteLine);
I'd strongly recommend reading C# in Depth: Unicode by Jon Skeet and Programming with Unicode by Victor Stinner books to have a much better understanding of what's going on:) Good luck.
PS. My code example is functional, compact but pretty inefficient, if you parse huge files consider using another solution.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.