[英]Read a file with unicode characters
I have an asp.net c# page and am trying to read a file that has the following charater ' and convert it to '. 我有一个asp.net c#页面,我正在尝试读取具有以下字符的文件并将其转换为'。 (From slanted apostrophe to apostrophe). (从倾斜的撇号到撇号)。
FileInfo fileinfo = new FileInfo(FileLocation);
string content = File.ReadAllText(fileinfo.FullName);
//strip out bad characters
content = content.Replace("’", "'");
This doesn't work and it changes the slanted apostrophes into ? 这不起作用,它将倾斜的撇号变为? marks. 分数。
I suspect that the problem is not with the replacement, but rather with the reading of the file itself. 我怀疑问题不在于替换,而在于读取文件本身。 When I tried this the nieve way (using Word and copy-paste) I ended up with the same results as you, however examining content
showed that the .Net framework believe that the character was Unicode character 65533
, ie the "WTF?" 当我尝试这种方式(使用Word和复制粘贴)时,我得到了与您相同的结果,但是检查content
显示.Net框架认为该字符是Unicode字符65533
,即“WTF?” character before the string replacement. 字符串替换前的字符。 You can check this yourself by examining the relevant character in the Visual Studio debugger, where it should show the character code: 您可以通过检查Visual Studio调试器中的相关字符来自行检查,它应显示字符代码:
content[0]; // 65533 '�'
The reason why the replace isn't working is simple - content
doesn't contain the string you gave it: 替换不起作用的原因很简单 - content
不包含您给它的字符串:
content.IndexOf("’"); // -1
As for why the file reading isn't working properly - you are probably using the wrong encoding when reading the file. 至于为什么文件读取不正常 - 您在读取文件时可能使用了错误的编码。 (If no encoding is specified then the .Net framework will try to determine the correct encoding for you, however there is no 100% reliable way to do this and so often it can get it wrong). (如果没有指定编码,则.Net框架将尝试为您确定正确的编码,但是没有100%可靠的方法来执行此操作,因此通常可能会出错)。 The exact encoding you need depends on the file itself, however in my case the encoding being used was Extended ASCII , and so to read the file I just needed to specify the correct encoding: 您需要的确切编码取决于文件本身,但在我的情况下,使用的编码是扩展ASCII ,因此要读取我只需要指定正确编码的文件:
string content = File.ReadAllText(fileinfo.FullName, Encoding.GetEncoding("iso-8859-1"));
(See this question ). (见这个问题 )。
You also need to make sure that you specify the correct character in your replacement string - when using "odd" characters in code you may find it more reliable to specify the character by its character code, rather than as a string literal (which may cause problems if the encoding of the source file changes), for example the following worked for me: 您还需要确保在替换字符串中指定正确的字符 - 在代码中使用“奇数”字符时,您可能会发现通过字符代码指定字符更可靠,而不是字符串文字(这可能会导致如果源文件的编码发生变化,则会出现问题,例如以下内容对我有用:
content = content.Replace("\u0092", "'");
// This should replace smart single quotes with a straight single quote
Regex.Replace(content, @"(\u2018|\u2019)", "'");
//However the better approach seems to be to read the page with the proper encoding and leave the quotes alone
var sreader= new StreamReader(fileInfo.Create(), Encoding.GetEncoding(1252));
My bet is the file is encoded in Windows-1252 . 我敢打赌,该文件是在Windows-1252中编码的。 This is almost the same as ISO 8859-1. 这几乎与ISO 8859-1相同。 The difference is Windows-1252 uses "displayable characters rather than control characters in the 0x80 to 0x9F range". 区别在于Windows-1252使用“可显示的字符而不是0x80到0x9F范围内的控制字符”。 (Which is where the slanted apostrophe is located. ie 0x92) (这是倾斜的撇号所在的位置。即0x92)
//Specify Windows-1252 here
string content = File.ReadAllText(fileinfo.FullName, Encoding.GetEncoding(1252));
//Your replace code will then work as is
content = content.Replace("’", "'");
If you use String (capitalized) and not string, it should be able to handle any Unicode you throw at it. 如果你使用String(大写)而不是字符串,它应该能够处理你抛出的任何Unicode。 Try that first and see if that works. 首先尝试,看看是否有效。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.