[英]Combine multiple files into single file
代码:
static void MultipleFilesToSingleFile(string dirPath, string filePattern, string destFile)
{
string[] fileAry = Directory.GetFiles(dirPath, filePattern);
Console.WriteLine("Total File Count : " + fileAry.Length);
using (TextWriter tw = new StreamWriter(destFile, true))
{
foreach (string filePath in fileAry)
{
using (TextReader tr = new StreamReader(filePath))
{
tw.WriteLine(tr.ReadToEnd());
tr.Close();
tr.Dispose();
}
Console.WriteLine("File Processed : " + filePath);
}
tw.Close();
tw.Dispose();
}
}
我需要对其进行优化,因为它非常慢:平均大小为 40 — 50 Mb XML 文件的 45 个文件需要 3 分钟。
请注意:平均 45 MB 的 45 个文件只是一个例子,它可以是n
m
大小的文件,其中n
以千为单位, m
可以是平均 128 Kb。 简而言之,它可以变化。
你能提供任何关于优化的意见吗?
为什么不直接使用Stream.CopyTo(Stream destination)
方法?
private static void CombineMultipleFilesIntoSingleFile(string inputDirectoryPath, string inputFileNamePattern, string outputFilePath)
{
string[] inputFilePaths = Directory.GetFiles(inputDirectoryPath, inputFileNamePattern);
Console.WriteLine("Number of files: {0}.", inputFilePaths.Length);
using (var outputStream = File.Create(outputFilePath))
{
foreach (var inputFilePath in inputFilePaths)
{
using (var inputStream = File.OpenRead(inputFilePath))
{
// Buffer size can be passed as the second argument.
inputStream.CopyTo(outputStream);
}
Console.WriteLine("The file {0} has been processed.", inputFilePath);
}
}
}
请注意,上述方法已重载。
有两种方法重载:
第二个方法重载通过bufferSize
参数提供缓冲区大小调整。
你可以做几件事:
我的经验是默认缓冲区大小可以增加到大约 120K 的显着好处,我怀疑在所有流上设置一个大缓冲区将是最简单和最显着的性能提升:
new System.IO.FileStream("File.txt", System.IO.FileMode.Open, System.IO.FileAccess.Read, System.IO.FileShare.Read, 150000);
使用Stream
类,而不是StreamReader
类。
using
语句。一种选择是利用复制命令,让它做擅长的事情。
就像是:
static void MultipleFilesToSingleFile(string dirPath, string filePattern, string destFile)
{
var cmd = new ProcessStartInfo("cmd.exe",
String.Format("/c copy {0} {1}", filePattern, destFile));
cmd.WorkingDirectory = dirPath;
cmd.UseShellExecute = false;
Process.Start(cmd);
}
我会使用 BlockingCollection 来读取,以便您可以同时读取和写入。
显然应该写入单独的物理磁盘以避免硬件争用。 此代码将保留顺序。
读取将比写入更快,因此不需要并行读取。
同样,由于读取速度会更快,因此限制了集合的大小,因此读取不会比写入更早。
在写入当前文件的同时并行读取单个 next 的简单任务存在文件大小不同的问题 - 写入小文件比读取大文件快。
我使用这种模式在 T1 上读取和解析文本,然后在 T2 上插入到 SQL。
public void WriteFiles()
{
using (BlockingCollection<string> bc = new BlockingCollection<string>(10))
{
// play with 10 if you have several small files then a big file
// write can get ahead of read if not enough are queued
TextWriter tw = new StreamWriter(@"c:\temp\alltext.text", true);
// clearly you want to write to a different phyical disk
// ideally write to solid state even if you move the files to regular disk when done
// Spin up a Task to populate the BlockingCollection
using (Task t1 = Task.Factory.StartNew(() =>
{
string dir = @"c:\temp\";
string fileText;
int minSize = 100000; // play with this
StringBuilder sb = new StringBuilder(minSize);
string[] fileAry = Directory.GetFiles(dir, @"*.txt");
foreach (string fi in fileAry)
{
Debug.WriteLine("Add " + fi);
fileText = File.ReadAllText(fi);
//bc.Add(fi); for testing just add filepath
if (fileText.Length > minSize)
{
if (sb.Length > 0)
{
bc.Add(sb.ToString());
sb.Clear();
}
bc.Add(fileText); // could be really big so don't hit sb
}
else
{
sb.Append(fileText);
if (sb.Length > minSize)
{
bc.Add(sb.ToString());
sb.Clear();
}
}
}
if (sb.Length > 0)
{
bc.Add(sb.ToString());
sb.Clear();
}
bc.CompleteAdding();
}))
{
// Spin up a Task to consume the BlockingCollection
using (Task t2 = Task.Factory.StartNew(() =>
{
string text;
try
{
while (true)
{
text = bc.Take();
Debug.WriteLine("Take " + text);
tw.WriteLine(text);
}
}
catch (InvalidOperationException)
{
// An InvalidOperationException means that Take() was called on a completed collection
Debug.WriteLine("That's All!");
tw.Close();
tw.Dispose();
}
}))
Task.WaitAll(t1, t2);
}
}
}
sergey-brunov发布的合并 2GB 文件的尝试解决方案。 系统为此工作占用了大约 2 GB 的 RAM。 我进行了一些更改以进行更多优化,现在需要 350MB RAM 来合并 2GB 文件。
private static void CombineMultipleFilesIntoSingleFile(string inputDirectoryPath, string inputFileNamePattern, string outputFilePath)
{
string[] inputFilePaths = Directory.GetFiles(inputDirectoryPath, inputFileNamePattern);
Console.WriteLine("Number of files: {0}.", inputFilePaths.Length);
foreach (var inputFilePath in inputFilePaths)
{
using (var outputStream = File.AppendText(outputFilePath))
{
// Buffer size can be passed as the second argument.
outputStream.WriteLine(File.ReadAllText(inputFilePath));
Console.WriteLine("The file {0} has been processed.", inputFilePath);
}
}
}
// Binary File Copy
public static void mergeFiles(string strFileIn1, string strFileIn2, string strFileOut, out string strError)
{
strError = String.Empty;
try
{
using (FileStream streamIn1 = File.OpenRead(strFileIn1))
using (FileStream streamIn2 = File.OpenRead(strFileIn2))
using (FileStream writeStream = File.OpenWrite(strFileOut))
{
BinaryReader reader = new BinaryReader(streamIn1);
BinaryWriter writer = new BinaryWriter(writeStream);
// create a buffer to hold the bytes. Might be bigger.
byte[] buffer = new Byte[1024];
int bytesRead;
// while the read method returns bytes keep writing them to the output stream
while ((bytesRead =
streamIn1.Read(buffer, 0, 1024)) > 0)
{
writeStream.Write(buffer, 0, bytesRead);
}
while ((bytesRead =
streamIn2.Read(buffer, 0, 1024)) > 0)
{
writeStream.Write(buffer, 0, bytesRead);
}
}
}
catch (Exception ex)
{
strError = ex.Message;
}
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.