简体   繁体   English

使用unicode字符读取文件

[英]Read a file with unicode characters

I have an asp.net c# page and am trying to read a file that has the following charater ' and convert it to '. 我有一个asp.net c#页面,我正在尝试读取具有以下字符的文件并将其转换为'。 (From slanted apostrophe to apostrophe). (从倾斜的撇号到撇号)。

FileInfo fileinfo = new FileInfo(FileLocation);
string content = File.ReadAllText(fileinfo.FullName);

//strip out bad characters
content = content.Replace("’", "'");

This doesn't work and it changes the slanted apostrophes into ? 这不起作用,它将倾斜的撇号变为? marks. 分数。

I suspect that the problem is not with the replacement, but rather with the reading of the file itself. 我怀疑问题不在于替换,而在于读取文件本身。 When I tried this the nieve way (using Word and copy-paste) I ended up with the same results as you, however examining content showed that the .Net framework believe that the character was Unicode character 65533 , ie the "WTF?" 当我尝试这种方式(使用Word和复制粘贴)时,我得到了与您相同的结果,但是检查content显示.Net框架认为该字符是Unicode字符65533 ,即“WTF?” character before the string replacement. 字符串替换的字符。 You can check this yourself by examining the relevant character in the Visual Studio debugger, where it should show the character code: 您可以通过检查Visual Studio调试器中的相关字符来自行检查,它应显示字符代码:

content[0]; // 65533 '�'

The reason why the replace isn't working is simple - content doesn't contain the string you gave it: 替换不起作用的原因很简单 - content不包含您给它的字符串:

content.IndexOf("’"); // -1

As for why the file reading isn't working properly - you are probably using the wrong encoding when reading the file. 至于为什么文件读取不正常 - 您在读取文件时可能使用了错误的编码。 (If no encoding is specified then the .Net framework will try to determine the correct encoding for you, however there is no 100% reliable way to do this and so often it can get it wrong). (如果没有指定编码,则.Net框架将尝试为您确定正确的编码,但是没有100%可靠的方法来执行此操作,因此通常可能会出错)。 The exact encoding you need depends on the file itself, however in my case the encoding being used was Extended ASCII , and so to read the file I just needed to specify the correct encoding: 您需要的确切编码取决于文件本身,但在我的情况下,使用的编码是扩展ASCII ,因此要读取我只需要指定正确编码的文件:

string content = File.ReadAllText(fileinfo.FullName, Encoding.GetEncoding("iso-8859-1"));

(See this question ). (见这个问题 )。

You also need to make sure that you specify the correct character in your replacement string - when using "odd" characters in code you may find it more reliable to specify the character by its character code, rather than as a string literal (which may cause problems if the encoding of the source file changes), for example the following worked for me: 您还需要确保在替换字符串中指定正确的字符 - 在代码中使用“奇数”字符时,您可能会发现通过字符代码指定字符更可靠,而不是字符串文字(这可能会导致如果源文件的编码发生变化,则会出现问题,例如以下内容对我有用:

content = content.Replace("\u0092", "'");
// This should replace smart single quotes with a straight single quote

Regex.Replace(content, @"(\u2018|\u2019)", "'");

//However the better approach seems to be to read the page with the proper encoding and leave the quotes alone
var sreader= new StreamReader(fileInfo.Create(), Encoding.GetEncoding(1252));

My bet is the file is encoded in Windows-1252 . 我敢打赌,该文件是在Windows-1252中编码的。 This is almost the same as ISO 8859-1. 几乎与ISO 8859-1相同。 The difference is Windows-1252 uses "displayable characters rather than control characters in the 0x80 to 0x9F range". 区别在于Windows-1252使用“可显示的字符而不是0x80到0x9F范围内的控制字符”。 (Which is where the slanted apostrophe is located. ie 0x92) (这是倾斜的撇号所在的位置。即0x92)

//Specify Windows-1252 here
string content = File.ReadAllText(fileinfo.FullName, Encoding.GetEncoding(1252));
//Your replace code will then work as is
content = content.Replace("’", "'");

If you use String (capitalized) and not string, it should be able to handle any Unicode you throw at it. 如果你使用String(大写)而不是字符串,它应该能够处理你抛出的任何Unicode。 Try that first and see if that works. 首先尝试,看看是否有效。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM