简体   繁体   中英

Read a file with unicode characters

I have an asp.net c# page and am trying to read a file that has the following charater ' and convert it to '. (From slanted apostrophe to apostrophe).

FileInfo fileinfo = new FileInfo(FileLocation);
string content = File.ReadAllText(fileinfo.FullName);

//strip out bad characters
content = content.Replace("’", "'");

This doesn't work and it changes the slanted apostrophes into ? marks.

I suspect that the problem is not with the replacement, but rather with the reading of the file itself. When I tried this the nieve way (using Word and copy-paste) I ended up with the same results as you, however examining content showed that the .Net framework believe that the character was Unicode character 65533 , ie the "WTF?" character before the string replacement. You can check this yourself by examining the relevant character in the Visual Studio debugger, where it should show the character code:

content[0]; // 65533 '�'

The reason why the replace isn't working is simple - content doesn't contain the string you gave it:

content.IndexOf("’"); // -1

As for why the file reading isn't working properly - you are probably using the wrong encoding when reading the file. (If no encoding is specified then the .Net framework will try to determine the correct encoding for you, however there is no 100% reliable way to do this and so often it can get it wrong). The exact encoding you need depends on the file itself, however in my case the encoding being used was Extended ASCII , and so to read the file I just needed to specify the correct encoding:

string content = File.ReadAllText(fileinfo.FullName, Encoding.GetEncoding("iso-8859-1"));

(See this question ).

You also need to make sure that you specify the correct character in your replacement string - when using "odd" characters in code you may find it more reliable to specify the character by its character code, rather than as a string literal (which may cause problems if the encoding of the source file changes), for example the following worked for me:

content = content.Replace("\u0092", "'");
// This should replace smart single quotes with a straight single quote

Regex.Replace(content, @"(\u2018|\u2019)", "'");

//However the better approach seems to be to read the page with the proper encoding and leave the quotes alone
var sreader= new StreamReader(fileInfo.Create(), Encoding.GetEncoding(1252));

My bet is the file is encoded in Windows-1252 . This is almost the same as ISO 8859-1. The difference is Windows-1252 uses "displayable characters rather than control characters in the 0x80 to 0x9F range". (Which is where the slanted apostrophe is located. ie 0x92)

//Specify Windows-1252 here
string content = File.ReadAllText(fileinfo.FullName, Encoding.GetEncoding(1252));
//Your replace code will then work as is
content = content.Replace("’", "'");

If you use String (capitalized) and not string, it should be able to handle any Unicode you throw at it. Try that first and see if that works.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM