简体   繁体   中英

Combine multiple files into single file

Code:

static void MultipleFilesToSingleFile(string dirPath, string filePattern, string destFile)
{
    string[] fileAry = Directory.GetFiles(dirPath, filePattern);

    Console.WriteLine("Total File Count : " + fileAry.Length);

    using (TextWriter tw = new StreamWriter(destFile, true))
    {
        foreach (string filePath in fileAry)
        {
            using (TextReader tr = new StreamReader(filePath))
            {
                tw.WriteLine(tr.ReadToEnd());
                tr.Close();
                tr.Dispose();
            }
            Console.WriteLine("File Processed : " + filePath);
        }

        tw.Close();
        tw.Dispose();
    }
}

I need to optimize this as its extremely slow: takes 3 minutes for 45 files of average size 40 — 50 Mb XML file.

Please note: 45 files of an average 45 MB is just one example, it can be n numbers of files of m size, where n is in thousands & m can be of average 128 Kb. In short, it can vary.

Could you please provide any views on optimization?

General answer

Why not just use the Stream.CopyTo(Stream destination) method ?

private static void CombineMultipleFilesIntoSingleFile(string inputDirectoryPath, string inputFileNamePattern, string outputFilePath)
{
    string[] inputFilePaths = Directory.GetFiles(inputDirectoryPath, inputFileNamePattern);
    Console.WriteLine("Number of files: {0}.", inputFilePaths.Length);
    using (var outputStream = File.Create(outputFilePath))
    {
        foreach (var inputFilePath in inputFilePaths)
        {
            using (var inputStream = File.OpenRead(inputFilePath))
            {
                // Buffer size can be passed as the second argument.
                inputStream.CopyTo(outputStream);
            }
            Console.WriteLine("The file {0} has been processed.", inputFilePath);
        }
    }
}

Buffer size adjustment

Please, note that the mentioned method is overloaded.

There are two method overloads:

  1. CopyTo(Stream destination) .
  2. CopyTo(Stream destination, int bufferSize) .

The second method overload provides the buffer size adjustment through the bufferSize parameter.

Several things you can do:

  • I my experience the default buffer sizes can be increased with noticeable benefit up to about 120K, I suspect setting a large buffer on all streams will be the easiest and most noticeable performance booster:

     new System.IO.FileStream("File.txt", System.IO.FileMode.Open, System.IO.FileAccess.Read, System.IO.FileShare.Read, 150000);
  • Use the Stream class, not the StreamReader class.

  • Read contents into a large buffer, dump them in output stream at once — this will speed up small files operations.
  • No need of the redundant close/dispose: you have the using statement.

One option is to utilize the copy command, and let it do what is does well.

Something like:

static void MultipleFilesToSingleFile(string dirPath, string filePattern, string destFile)
{
    var cmd = new ProcessStartInfo("cmd.exe", 
        String.Format("/c copy {0} {1}", filePattern, destFile));
    cmd.WorkingDirectory = dirPath;
    cmd.UseShellExecute = false;
    Process.Start(cmd);
}

I would use a BlockingCollection to read so you can read and write concurrently.
Clearly should write to a separate physical disk to avoid hardware contention. This code will preserve order.
Read is going to be faster than write so no need for parallel read.
Again since read is going to be faster limit the size of the collection so read does not get farther ahead of write than it needs to.
A simple task to read the single next in parallel while writing the current has the problem of different file sizes - write a small file is faster than read a big.

I use this pattern to read and parse text on T1 and then insert to SQL on T2.

public void WriteFiles()
{
    using (BlockingCollection<string> bc = new BlockingCollection<string>(10))
    {
        // play with 10 if you have several small files then a big file
        // write can get ahead of read if not enough are queued

        TextWriter tw = new StreamWriter(@"c:\temp\alltext.text", true);
        // clearly you want to write to a different phyical disk 
        // ideally write to solid state even if you move the files to regular disk when done
        // Spin up a Task to populate the BlockingCollection
        using (Task t1 = Task.Factory.StartNew(() =>
        {
            string dir = @"c:\temp\";
            string fileText;      
            int minSize = 100000; // play with this
            StringBuilder sb = new StringBuilder(minSize);
            string[] fileAry = Directory.GetFiles(dir, @"*.txt");
            foreach (string fi in fileAry)
            {
                Debug.WriteLine("Add " + fi);
                fileText = File.ReadAllText(fi);
                //bc.Add(fi);  for testing just add filepath
                if (fileText.Length > minSize)
                {
                    if (sb.Length > 0)
                    { 
                       bc.Add(sb.ToString());
                       sb.Clear();
                    }
                    bc.Add(fileText);  // could be really big so don't hit sb
                }
                else
                {
                    sb.Append(fileText);
                    if (sb.Length > minSize)
                    {
                        bc.Add(sb.ToString());
                        sb.Clear();
                    }
                }
            }
            if (sb.Length > 0)
            {
                bc.Add(sb.ToString());
                sb.Clear();
            }
            bc.CompleteAdding();
        }))
        {

            // Spin up a Task to consume the BlockingCollection
            using (Task t2 = Task.Factory.StartNew(() =>
            {
                string text;
                try
                {
                    while (true)
                    {
                        text = bc.Take();
                        Debug.WriteLine("Take " + text);
                        tw.WriteLine(text);                  
                    }
                }
                catch (InvalidOperationException)
                {
                    // An InvalidOperationException means that Take() was called on a completed collection
                    Debug.WriteLine("That's All!");
                    tw.Close();
                    tw.Dispose();
                }
            }))

                Task.WaitAll(t1, t2);
        }
    }
}

BlockingCollection Class

Tried solution posted by sergey-brunov for merging 2GB file. System took around 2 GB of RAM for this work. I have made some changes for more optimization and it now takes 350MB RAM to merge 2GB file.

private static void CombineMultipleFilesIntoSingleFile(string inputDirectoryPath, string inputFileNamePattern, string outputFilePath)
        {
            string[] inputFilePaths = Directory.GetFiles(inputDirectoryPath, inputFileNamePattern);
            Console.WriteLine("Number of files: {0}.", inputFilePaths.Length);
            foreach (var inputFilePath in inputFilePaths)
            {
                using (var outputStream = File.AppendText(outputFilePath))
                {
                    // Buffer size can be passed as the second argument.
                    outputStream.WriteLine(File.ReadAllText(inputFilePath));
                    Console.WriteLine("The file {0} has been processed.", inputFilePath);

                }
            }
        }
    // Binary File Copy
    public static void mergeFiles(string strFileIn1, string strFileIn2, string strFileOut, out string strError)
    {
        strError = String.Empty;
        try
        {
            using (FileStream streamIn1 = File.OpenRead(strFileIn1))
            using (FileStream streamIn2 = File.OpenRead(strFileIn2))
            using (FileStream writeStream = File.OpenWrite(strFileOut))
            {
                BinaryReader reader = new BinaryReader(streamIn1);
                BinaryWriter writer = new BinaryWriter(writeStream);

                // create a buffer to hold the bytes. Might be bigger.
                byte[] buffer = new Byte[1024];
                int bytesRead;

                // while the read method returns bytes keep writing them to the output stream
                while ((bytesRead =
                        streamIn1.Read(buffer, 0, 1024)) > 0)
                {
                    writeStream.Write(buffer, 0, bytesRead);
                }
                while ((bytesRead =
                        streamIn2.Read(buffer, 0, 1024)) > 0)
                {
                    writeStream.Write(buffer, 0, bytesRead);
                }
            }
        }
        catch (Exception ex)
        {
            strError = ex.Message;
        }
    }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM