How to read and store string in UTF-8 format in C#?

Question

I have a file with URLs, one of which is http://en.wikipedia.org/wiki/São_Paulo . Note that 'ã'. When I read the URLs (in C#) and try to print it, it appears as http://en.wikipedia.org/wiki/S?o_Paulo .

I tried reading the URLs as following:

List<string> urls = System.IO.File.ReadAllLines(wikiURL_FilePath, Encoding.UTF8).ToList();

Note that I have passed second argument to read it in UTF8 format, but still the problem is not rectified. How can I read and store the string in correct form?

Answer 1

The data you have shown is simply not UTF-8, despite having a UTF-8 BOM; the UTF-8 for São is 53-C3-A3-6F; you have 53-E3-6F, which is... the right unicode code-points for basic multi-lingual plane data, but incorrectly encoded to disk as UTF-8. You probably need to fix the code that wrote this file, or: agree on what the encoding is (it could be a single-byte code-page, but you need to agree which, else everything falls apart).

Likely looking encodings (if we take away the BOM):

utf-7
windows-1252
windows-1254
iso-8859-1
iso-8859-4
iso-8859-9
iso-8859-15

How to read and store string in UTF-8 format in C#?

Question

1 answers

solution1
3 ACCPTED 2015-07-20 09:29:45

How to read and store string in UTF-8 format in C#?

Question

1 answers

solution1 3 ACCPTED 2015-07-20 09:29:45

solution1
3 ACCPTED 2015-07-20 09:29:45