简体   繁体   English

File.ReadAllText 中的无效字符

[英]Invalid characters in File.ReadAllText

I'm calling File.ReadAllText() in a program designed to format some files that I have.我在一个旨在格式化我拥有的文件的程序中调用File.ReadAllText()

Some of these files contain the ® (174) symbol.其中一些文件包含® (174) 符号。 However, when the text is being read, the returned string contains (65533) symbols where the ® (174) should be.但是,在读取文本时,返回的字符串包含 (65533) 个符号,而® (174) 应在该位置。

What would cause this and how can I fix it?什么会导致这种情况,我该如何解决?

This is likely due to a mismatch in the Encoding .这可能是由于Encoding不匹配造成的。 Use the ReadAllText overload which allows you to specify the proper Encoding to use when reading the file.使用ReadAllText重载,它允许您指定在读取文件时使用的正确Encoding

The default overload will assume UTF-8 unless it can detect UTF-32.默认重载将假定为 UTF-8,除非它可以检测到 UTF-32。 Any other encoding will come through incorrectly.任何其他编码都会错误地通过。

Most likely the file contains a different encoding than the default.该文件很可能包含与默认值不同的编码。 If you know it, you can specify it using the File.ReadAllText Method (String, Encoding) override.如果您知道,您可以使用File.ReadAllText 方法(字符串,编码)覆盖来指定它。

Code sample:代码示例:

string readText = File.ReadAllText(path, Encoding.Default);  // <-- change the encoding to whatever the encoding really is

If you DON'T know the encoding, see this previous SO question: How to use ReadAllText when file encoding unknown如果您知道编码,请参阅上一个 SO 问题: How to use ReadAllText when file encoding unknown

You need to specify the encoding when you call File.ReadAllText , unless the file is actually in UTF-8, which it sounds like it's not.您需要在调用File.ReadAllText时指定编码,除非文件实际上是 UTF-8,听起来好像不是。 (Basically the one-parameter overload is equivalent to passing in UTF-8 as the second argument. It will also detect UTF-32 with an appropriate byte-order mark, I believe.) (基本上,单参数重载相当于将 UTF-8 作为第二个参数传递。我相信,它还将使用适当的字节顺序标记检测 UTF-32。)

The first thing is to work out which encoding it is in (eg ISO-8859-1 - but you need to check this) and then pass that as a second argument.的第一件事就是制定出哪些编码它在(如ISO-8859-1 -但你需要检查这一点),然后传递作为第二个参数。

For example:例如:

Encoding isoLatin1 = Encoding.GetEncoding(28591);
string text = File.ReadAllText(path, isoLatin1);

It's always important that you know what encoding binary data is using before you try to read it as text.在尝试将其作为文本阅读之前,了解正在使用的编码二进制数据总是很重要的。 That's true for files, network streams, anything.对于文件、网络流等任何事物都是如此。

The character you are reading is the Replacement character您正在阅读的字符是替换字符

used to replace an incoming character whose value is unknown or unrepresentable in Unicode compare the use of U+001A as a control character to indicate the substitute function用于替换值未知或在 Unicode 中无法表示的传入字符 比较使用 U+001A 作为控制字符来指示替换功能

http://www.fileformat.info/info/unicode/char/fffd/index.htm http://www.fileformat.info/info/unicode/char/fffd/index.htm

You are getting this because the actual encoding of the file does not match the encoding your program expects.你得到这个是因为文件的实际编码与你的程序期望的编码不匹配。

By default ReadAllText expects UTF-8.默认情况下,ReadAllText 需要 UTF-8。 It is encountering a byte sequence that does not represent a valid UTF-8 character, so replacing it with the Replacement character .它遇到了不代表有效 UTF-8 字符的字节序列,因此将其替换为Replacement character

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM