简体   繁体   English

以所选编码将文本文件读入字节数组的最佳方法?

[英]Best way to read text file into byte array in selected encoding?

Now i use something like that:现在我使用类似的东西:

Encoding.UTF8.GetBytes(File.ReadAllText(filename))

Any suggestions how to do that better?任何建议如何更好地做到这一点?

And what encoding uses File.ReadAllBytes(filename) method?什么编码使用File.ReadAllBytes(filename)方法?

PS I need utf-8 byte arrays to store text files in db PS我需要utf-8字节数组来在db中存储文本文件

Best way to read file into byte array in selected encoding?以所选编码将文件读入字节数组的最佳方法?

Character Encoding is about storing text in binary form, as sequences of specific bytes for each character.字符编码是以二进制形式存储文本,作为每个字符的特定字节序列。 Another way of thinking about it is that the Encoding system is what gives meaning to some bytes.另一种思考方式是编码系统赋予某些字节以意义 Without the context that some bytes represents text, the bytes are just bytes.如果没有某些字节表示文本的上下文,则字节只是字节。

Files are just bytes too;文件也只是字节; And they can be interpreted however you want your application to interpret them.并且可以根据您希望应用程序解释它们的方式来解释它们。

When you decode bytes you are giving meaning to those bytes according the encoding system used.当您解码字节时,您会根据所使用的编码系统赋予这些字节以意义。 For text encodings, you start with bytes and end up with characters.对于文本编码,以字节开始,以字符结束。
You can't " decode " bytes from a file into a byte array.您不能将文件中的字节“解码”为字节数组。 That doesn't give meaning to the bytes or produce any characters.这不会赋予字节任何意义或产生任何字符。

You can decode bytes into strings using a specific encoding though:可以使用特定编码将字节解码为字符串:

string allLinesFromFileAsAuto = File.ReadAllText(filename);
string allLinesFromFileAsUTF8 = File.ReadAllText(filename, Encoding.UTF8);
string allLinesFromFileAsASCII = File.ReadAllText(filename, Encoding.ASCII);

All three of these methods convert bytes from the same file into strings, but the resulting strings will be different depending on the encoding you use.所有这三种方法都将来自同一文件的字节转换为字符串,但根据您使用的编码,生成的字符串会有所不同。

And what encoding uses File.ReadAllBytes(filename) method?什么编码使用File.ReadAllBytes(filename)方法?

File.ReadAllBytes(filename) does not use any encoding. File.ReadAllBytes(filename)不使用任何编码。 Files are just bytes.文件只是字节。 This method pulls all of a file's bytes into a byte array.此方法将文件的所有字节拉入字节数组。 You still have to decode those bytes into strings after getting that byte array.获得该字节数组后,您仍然必须将这些字节解码为字符串。 But this only works well for plaintext files.但这仅适用于纯文本文件。

I need utf-8 byte arrays to store files in db我需要 utf-8 字节数组来在 db 中存储文件

Is this because your database uses UTF-8 encoding?这是因为您的数据库使用 UTF-8 编码吗?
The encoding of a database defines how text is stored ( as binary ).数据库的编码定义了文本的存储方式(作为二进制)。 Binary data can be stored as-is, byte-for-byte, as "blobs" in most databases, regardless of the encoding.在大多数数据库中,无论编码如何,二进制数据都可以按原样、逐字节存储为“blob”。

ReadAllText will try to infer the encoding of the file and convert it to .NET strings. ReadAllText将尝试推断文件的编码并将其转换为 .NET 字符串。 Your first example will then convert those to UTF-8 bytes no matter what the source encoding was.无论源编码是什么,您的第一个示例都会将它们转换为 UTF-8 字节。

Depending on the size of the files, this could be costly to load it all to memory twice.根据文件的大小,将其全部加载到内存两次可能代价高昂。 You can do things to read chunks of the source file and convert it that way.你可以做一些事情来读取源文件的块并以这种方式转换它。

ReadAllBytes reads the raw file as a series of bytes, there's no encoding/decoding for that. ReadAllBytes将原始文件读取为一系列字节,没有编码/解码。

If you are storing non-text files in the database, you should not encode the file as UTF-8.如果您在数据库中存储非文本文件,则不应将文件编码为 UTF-8。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM