[英]How do I convert encoding of a large file (>1 GB) in size - to Windows 1252 without an out-of-memory exception?
Consider: 考虑:
public static void ConvertFileToUnicode1252(string filePath, Encoding srcEncoding)
{
try
{
StreamReader fileStream = new StreamReader(filePath);
Encoding targetEncoding = Encoding.GetEncoding(1252);
string fileContent = fileStream.ReadToEnd();
fileStream.Close();
// Saving file as ANSI 1252
Byte[] srcBytes = srcEncoding.GetBytes(fileContent);
Byte[] ansiBytes = Encoding.Convert(srcEncoding, targetEncoding, srcBytes);
string ansiContent = targetEncoding.GetString(ansiBytes);
// Now writes contents to file again
StreamWriter ansiWriter = new StreamWriter(filePath, false);
ansiWriter.Write(ansiContent);
ansiWriter.Close();
//TODO -- log success details
}
catch (Exception e)
{
throw e;
// TODO -- log failure details
}
}
The above piece of code returns an out-of-memory exception for large files and only works for small-sized files. 上面的代码返回大文件的内存不足异常,仅适用于小型文件。
I think still using a StreamReader
and a StreamWriter
but reading blocks of characters instead of all at once or line by line is the most elegant solution. 我认为仍然使用
StreamReader
和StreamWriter
但是读取字符块而不是一次性或逐行读取是最优雅的解决方案。 It doesn't arbitrarily assume the file consists of lines of manageable length, and it also doesn't break with multi-byte character encodings. 它不会随意假设文件由可管理长度的行组成,并且它也不会破坏多字节字符编码。
public static void ConvertFileEncoding(string srcFile, Encoding srcEncoding, string destFile, Encoding destEncoding)
{
using (var reader = new StreamReader(srcFile, srcEncoding))
using (var writer = new StreamWriter(destFile, false, destEncoding))
{
char[] buf = new char[4096];
while (true)
{
int count = reader.Read(buf, 0, buf.Length);
if (count == 0)
break;
writer.Write(buf, 0, count);
}
}
}
(I wish StreamReader
had a CopyTo
method like Stream
does, if it had, this would be essentially a one-liner!) (我希望
StreamReader
有像Stream
这样的CopyTo
方法,如果有的话,这本质上就是一个单行!)
Don't readToEnd and read it like line by line or X characters at a time. 不要readToEnd并逐行读取或一次读取X字符。 If you read to end, you put your whole file into the buffer at once.
如果您阅读结束,则立即将整个文件放入缓冲区。
Try this: 尝试这个:
using (FileStream fileStream = new FileStream(filePath, FileMode.Open))
{
int size = 4096;
Encoding targetEncoding = Encoding.GetEncoding(1252);
byte[] byteData = new byte[size];
using (FileStream outputStream = new FileStream(outputFilepath, FileMode.Create))
{
int byteCounter = 0;
do
{
byteCounter = fileStream.Read(byteData, 0, size);
// Convert the 4k buffer
byteData = Encoding.Convert(srcEncoding, targetEncoding, byteData);
if (byteCounter > 0)
{
outputStream.Write(byteData, 0, byteCounter);
}
}
while (byteCounter > 0);
inputStream.Close();
}
}
Might have some syntax errors as I've done it from memory but this is how I work with large files, read in a chunk at a time, do some processing and save the chunk back. 可能有一些语法错误,因为我是从内存中完成的,但这就是我如何使用大文件,一次读取一块,进行一些处理并保存块。 It's really the only way of doing it (streaming) without relying on massive IO overhead of reading everything and huge RAM consumption of storing it all, converting it all in memory and then saving it all back.
这是实现它(流式传输)的唯一方式,而不依赖于读取所有内容的大量IO开销以及存储所有内存的大量RAM消耗,将其全部转换为内存然后将其全部保存回来。
You can always adjust the buffer size. 您始终可以调整缓冲区大小。
If you want your old method to work without throwing the OutOfMemoryException
, you need to tell the Garbage Collector to allow very large objects. 如果您希望旧方法在不抛出
OutOfMemoryException
情况下工作,则需要告知垃圾收集器允许非常大的对象。
In App.config, under <runtime>
add this following line (you shouldn't need it with my code but it's worth knowing): 在App.config中,在
<runtime>
添加以下行(您的代码不需要它,但值得了解):
<gcAllowVeryLargeObjects enabled="true" />
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.